Spark

Deploying a log monitor with Spark Structured Streaming

In this article I am going to cover how to create a simple streaming log monitor using Spark Streaming. In subsequent posts, I’ll go deeper on some of these aspects. The first thing we need to do, as always is to create a Spark Session. There isn’t much new here, except that we import the […]

Read more
Spark

Handling corrupted input files during ETL

Often, we develop ETL pipelines and they just work. They do exactly what we expect them to do and produce the correct outputs. One day though, your script stops working and the error points to it being a corrupted file. That’s pretty annoying – it took you away from doing something you enjoy and now […]

Read more
Spark

Melting our Spark dataframes with Koalas

Melting dataframes is the process of taking a short, wide table and making it into a long thin one, using column headings as categorical data within the resulting dataframe. In the below example we have a dataframe which shows the total kilometres walked and cycled per person. NAME BIKEKM WALKKM Kieran 77 178 Bobby 79 […]

Read more
Spark

Testing our UDF’s with UnitTest

Notebooks make it easy to test our code on the fly and check the output dataframe looks correct. However, it is good practice to run some unit tests with some edge cases – things you may not see very often & may not be in your sample data and it’s also important to check that […]

Read more
Spark

Testing in PySpark (Part 1)

Testing is one of the most painful parts of data engineering, especially when you have a really huge data set. But it is an absolutely necessary part of every project, as without it, we can’t have complete confidence in our data. We can write unit tests using libraries like PyTest, which I will cover in […]

Read more
Spark

Data Cleaning in PySpark

Most of our time as data engineers is spent cleaning up data. We have to deal with null values; duplicated rows; weird date formats; inconsistently capitalised words; accents (like the Spanish tilde in words); outliers and other nasty stuff. So, in this post we’re going to cover the below: Handling null values Handling duplicated data […]

Read more
Spark

Koalas = Pandas simplicity + Spark’s scalability

Whenever you start poking around a dataset to see what you’ve got to play with, you probably immediately write ‘import pandas as pd’ – why? Because Pandas is the gold standard in data analysis libraries; it’s so simple & yet so powerful. The problem is, Pandas just doesn’t scale, so if you’re going to be […]

Read more
Spark

Solving the UDF performance problem with Pandas

I mentioned in a previous article that for performance reasons, you should avoid the use of UDF’s wherever possible. And while that statement still stands, if you absolutely must use a UDF, you should consider a Pandas UDF rather than those that come out of the box with Spark. The standard UDF’s in Spark operate […]

Read more
Spark

Using Kryo Serialization to boost Spark performance by 20%

Data serialisation plays a critical role in the performance of our data analytics scripts. In this article, we’ll discuss serialisation as a whole & then dive into Kryo serialisation in PySpark. What is serialisation? Often, we store our data in arrays, tables, trees, classes and other data structures, which are inefficient to transport. So, we […]

Read more
Spark

Repartition vs Coalesce

When we ingest data into Spark it’s automatically partitioned and those partitions are distributed across different nodes. So, imagine we have bought some data into Spark, it might be split across partitions like this: Partition 1 Partition 2 Partition 3 1, 7, 5, 9, 12 15, 21, 12, 17, 19 45, 32, 25, 99 When […]

Read more
Spark

Overcoming Spark Errors

In the last article on performance tuning Apache Spark, we spoke about caching, UDF’s and limiting your input data, which are all simple ways to see some potentially drastic performance improvements. In this post, we’re going to talk about the environmental factors that you can control to overcome some of the more common Spark issues. […]

Read more
Spark

Performance Tuning Apache Spark

Apache Spark provides us with a framework to crunch a huge amount of data efficiently by leveraging parallelism which is great! Tweaking it can be a bit of an art which is not so great! In this article, I will talk about some of the ways that I have successfully tuned the performance of my […]

Read more