Data Cleaning in PySpark

Most of our time as data engineers is spent cleaning up data. We have to deal with null values; duplicated rows; weird date formats; inconsistently capitalised words; accents (like the Spanish tilde in words); outliers and other nasty stuff. So, in this post we’re going to cover the below:

  1. Handling null values
  2. Handling duplicated data
  3. Transforming date formats
  4. Dealing with upper / lower case issues
  5. Handling outliers

Handling Null Values

Let’s look at dealing with null values first. Here is my original dataframe. As you can see, I have a null value in latitude; two null values in job title and a null value in City. Let’s look at some ways to handle these.

We could just use na.drop(), which drops the entire row, where there is a null value present. So you can see, ID’s 3, 5 and 7 are no longer present in our dataset.

That’s quite useful, sometimes. But we can handle the nulls in a different way too. We can for example specify specific fields and say ‘drop all rows where job title is empty. This would drop rows 7 and 3 but row 5 would remain, as below.

You can separate lots of fields with commas, should you wish to run this logic on multiple fields within the dataframe.

You could even use the filter function to deal with the same. In the below, we’re filtering the dataframe to provide the rows where job title is NOT null.

What about replacing null values?

We can do this in Spark nice and easily too. Think about your numeric fields for a second – wouldn’t dropping the rows skew your analysis – especially if you need to drop a lot of them. Would it be better to populate them with the median or mean of the column, or even zero, zero might be ok.

In the below, I fill all null values in the latitude column with zero. Again, we can create a list of fields in the subset definition & apply the same logic to many fields.

Another way I like to handle this is as below. Here, I have created a new column, where I’ve said ‘if the city is null, populate it with ‘unknown’. Otherwise, leave it as it is. This gives us a new field to play with and we can drop the old one out of the dataframe. It’s really up to you which approach you prefer.

What about duplicate data?

Duplicate data can be a pain but again, it’s quite simple with Spark. We can simply run dropduplicates() on the dataframe to remove rows that are identical to another row in the dataset.

We can also use the subset method. dropduplicates(subset=[‘fieldname’]). The problem here is, it doesn’t always take the first occurrence and drop the rest, so if that’s important to you, you need to work around it. One such method to do that is this:

  1. Create a dataframe
  2. Order by the timestamp field
  3. Coalesce the dataframe to a single partition with .coalesce(1)
  4. drop duplicates based on your subset column

As above, once it’s been ordered in a single partition, it does seem to drop the first value.

What if I have a funky date format?

In the below, we have the date format Month/Day/Year and in the UK that’s a weird date format. We use Day/Month/Year. So, I could use a load of string functions to split the string and re-order it. But that would be pretty painful.

In the below, I have written a UDF which will convert from the original date format into the desired one.

Handling upper/lower case inconsistencies

Sometimes, we have issues running functions on fields because their cases all differ. So, we can use the upper or lower functions to standardize.

Removing Outliers

I often run into the issue where a particular system has had a funny five minutes and I get an erroneous reading. For example, CPU might show as 5,087,265% for a few seconds, before returning to normal. Clearly it’s not correct & if I ran an average over a whole day, it would skew it masively.

The best way I have found to deal with outliers is to take only data within X standard deviations of the mean or to use a moving average over x periods to flatten out your dataset, if the nuances aren’t of importance.

Kodey