Spark

Solving the UDF performance problem with Pandas

I mentioned in a previous article that for performance reasons, you should avoid the use of UDF’s wherever possible. And while that statement still stands, if you absolutely must use a UDF, you should consider a Pandas UDF rather than those that come out of the box with Spark. The standard UDF’s in Spark operate […]

Read more
Spark

Using Kryo Serialization to boost Spark performance by 20%

Data serialisation plays a critical role in the performance of our data analytics scripts. In this article, we’ll discuss serialisation as a whole & then dive into Kryo serialisation in PySpark. What is serialisation? Often, we store our data in arrays, tables, trees, classes and other data structures, which are inefficient to transport. So, we […]

Read more
Spark

Repartition vs Coalesce

When we ingest data into Spark it’s automatically partitioned and those partitions are distributed across different nodes. So, imagine we have bought some data into Spark, it might be split across partitions like this: Partition 1 Partition 2 Partition 3 1, 7, 5, 9, 12 15, 21, 12, 17, 19 45, 32, 25, 99 When […]

Read more