Testing is one of the most painful parts of data engineering, especially when you have a really huge data set. But it is an absolutely necessary part of every project, as without it, we can’t have complete confidence in our data.
We can write unit tests using libraries like PyTest, which I will cover in part two of this article series. In this post though, I am going to talk about some ways we can test the script output versus another source.
So let’s imagine, we have a Spark script that reads files from HDFS, that data is also available in Hive. We could, write our script in Spark and validate it with the output of a Spark SQL script. The idea here is that, if scripts written in two different languages match, your script is likely correct, as per the logic you understood and implemented.
The first check we can do is to check if the row counts in both of the dataframes are equal to one another. You can see here that, yes, both of the dataframes have the same number of rows, so far then, it’s looking good!
Next, let’s use the subtract function. This script looks at both dataframes and outputs the rows that exist in one dataframe but not the other. You can see in the below, that we have rows that exist in df1 and don’t exist in df2. So while we have the same number of rows in both dataframes, it seems that those rows aren’t the same.
We can then check whether the column names and data types are the same. To do that, we convert our Spark dataframe to a Pandas dataframe. We then use the pd.testing.assert_frame_equal function to determine if the two dataframes are the same. You can see below, the error shows that the two dataframes have different column names, so the dataframes are not equal.
So finally, we could use a df.describe() on each of the dataframes. Remember, this gives us summary statistics – like counts, mean, standard deviation etc… If the dataframe stats are identical, you most likely have the same dataset.
So there we have it, part 1 of my testing approach. These approaches should help you tell if your expected output matches your actual output.
In my next post, I’ll talk about PyTest, which is a unit testing library in Python.