Follow the pathway & improve your Spark knowledge!
There are a few simple tips and tricks you can use to significantly improve pyspark performance and optimize your Apache Spark scripts. In this blog post, we’ll look at 9 ways to improve the performance of your Apache Spark scripts in Python.
By using the Kryo serializer, you can reduce the overhead associated with serializing and deserializing objects in PySpark, allowing you to maximize performance and scalability
Window functions are incredibly useful. Within a single query, you can find out things which may have otherwise been tricky. In this article, I will cover all of the key window functions in Pyspark.
Koalas is an open source library for Apache Spark that makes it easier to use Spark with popular Python data science libraries. It is designed to provide a pandas-like API that is compatible with the built-in PySpark dataframe API, allowing data scientists to move between pandas and Spark dataframes without having to learn a new API.
Corrupt files refer to data files that contain errors or inconsistencies which prevent them from being processed correctly. Spark’s robust file system can detect these corrupt files and flag them for potential issues.
Small files in Apache Spark and Hadoop can cause a huge performance bottleneck, as the processing of these files requires extra computational resources and processing time. Small files are typically those files that are smaller than 128MB in size. While these small files do not take up as much space on a distributed system as large files, they present other issues such as longer execution time and increased I/O requests.
Most of our time as data engineers is spent cleaning up data. We have to deal with null values; duplicated rows; weird date formats; inconsistently capitalised words; accents (like the Spanish tilde in words); outliers and other nasty stuff. So, in this post we’re going to cover the below: