Spark

A crash course in UDF’s in Pyspark

Okay, so the first thing I should note is that you should avoid UDF’s (User Defined Functions) like the plague unless you absolutely have to; I have spoken about why that is here. Now that I’ve warned you, let’s talk about what a UDF is. A UDF is a User Defined Function, it’s a function […]

Read more
Spark

Fault Tolerant Friday: Creating robust pipelines in Python

When you’re dealing with tonnes of different datasources, issues crop up all the time: the datasource hasn’t been populated; the schema has changed; the schedule changes or the file format changes. The key with our jobs is to build in enough checks to at the very least warn us if something fails, so we can […]

Read more
Spark

Starter Sundays: PySpark Basics, Selecting & Filtering Data

Welcome to this weeks Starter Sunday – which is where I cover off some basic concepts of Spark, Python or Hive. Today, we’re looking at selecting and filtering data from our dataframes in Spark, specifically, Pyspark. Select specific columns Below, we have a dataframe called df_select which will take just two columns from the dataframe […]

Read more
Spark

Spark Streaming: Introduction to DStreams

In my last article, I showed an example of a Spark Streaming application for ingesting application log files. Today, I am going to cover off an overview of DStreams in Spark Streaming. In subsequent articles, we’ll really look in detail at some of the main Spark Streaming concepts. We can source data from loads of […]

Read more
Spark

Deploying a log monitor with Spark Structured Streaming

In this article I am going to cover how to create a simple streaming log monitor using Spark Streaming. In subsequent posts, I’ll go deeper on some of these aspects. The first thing we need to do, as always is to create a Spark Session. There isn’t much new here, except that we import the […]

Read more
Spark

Handling corrupted input files during ETL

Often, we develop ETL pipelines and they just work. They do exactly what we expect them to do and produce the correct outputs. One day though, your script stops working and the error points to it being a corrupted file. That’s pretty annoying – it took you away from doing something you enjoy and now […]

Read more
Spark

Melting our Spark dataframes with Koalas

Melting dataframes is the process of taking a short, wide table and making it into a long thin one, using column headings as categorical data within the resulting dataframe. In the below example we have a dataframe which shows the total kilometres walked and cycled per person. NAME BIKEKM WALKKM Kieran 77 178 Bobby 79 […]

Read more
Spark

Testing our UDF’s with UnitTest

Notebooks make it easy to test our code on the fly and check the output dataframe looks correct. However, it is good practice to run some unit tests with some edge cases – things you may not see very often & may not be in your sample data and it’s also important to check that […]

Read more
Spark

Testing in PySpark (Part 1)

Testing is one of the most painful parts of data engineering, especially when you have a really huge data set. But it is an absolutely necessary part of every project, as without it, we can’t have complete confidence in our data. We can write unit tests using libraries like PyTest, which I will cover in […]

Read more
Spark

Data Cleaning in PySpark

Most of our time as data engineers is spent cleaning up data. We have to deal with null values; duplicated rows; weird date formats; inconsistently capitalised words; accents (like the Spanish tilde in words); outliers and other nasty stuff. So, in this post we’re going to cover the below: Handling null values Handling duplicated data […]

Read more