Apache Spark Learning Pathway

Follow the pathway & improve your Spark knowledge!

learn Spark with Kodey

Learn To Code In Spark

Learn to ingest & transform data in Spark dataframes using Python in our video course

Understand How Spark Works

Take our video course to understand the Spark Architecture & the inner workings of Spark to enable you to performance tune & overcome issues.

Optimize Your Spark Job Performance

There are a few simple tips and tricks you can use to significantly improve pyspark performance and optimize your Apache Spark scripts. In this blog post, we’ll look at 9 ways to improve the performance of your Apache Spark scripts in Python.

Further Improve Performance with Serialization

By using the Kryo serializer, you can reduce the overhead associated with serializing and deserializing objects in PySpark, allowing you to maximize performance and scalability

Window Functions in Spark

Window functions are incredibly useful. Within a single query, you can find out things which may have otherwise been tricky. In this article, I will cover all of the key window functions in Pyspark.

Converting Pandas to Spark

Koalas is an open source library for Apache Spark that makes it easier to use Spark with popular Python data science libraries. It is designed to provide a pandas-like API that is compatible with the built-in PySpark dataframe API, allowing data scientists to move between pandas and Spark dataframes without having to learn a new API.

Handling Corrupt Files

Corrupt files refer to data files that contain errors or inconsistencies which prevent them from being processed correctly. Spark’s robust file system can detect these corrupt files and flag them for potential issues.

Handling Small Files

Small files in Apache Spark and Hadoop can cause a huge performance bottleneck, as the processing of these files requires extra computational resources and processing time. Small files are typically those files that are smaller than 128MB in size. While these small files do not take up as much space on a distributed system as large files, they present other issues such as longer execution time and increased I/O requests.

Data Cleaning in Spark

Most of our time as data engineers is spent cleaning up data. We have to deal with null values; duplicated rows; weird date formats; inconsistently capitalised words; accents (like the Spanish tilde in words); outliers and other nasty stuff. So, in this post we’re going to cover the below: