Notebooks make it easy to test our code on the fly and check the output dataframe looks correct. However, it is good practice to run some unit tests with some edge cases – things you may not see very often & may not be in your sample data and it’s also important to check that invalid data formats error appropriately.
Below we have my simple UDF. It converts dates from MM/DD/YYYY to DD/MM/YY format. Looking at the output dataframe, I think it’s all working perfectly, let’s test if that is the case.
Below, I have written two test cases. The first, runs three date values through the function – with an input value and the corresponding expected output value. For each line, it will say ‘if I pass this date into the function, is this the date I will be returned?’ – if it matches, your test passes, if it doesn’t match, you’ll get a failure.
The second test is about error handling. You can see that I am testing for ValueError handling and TypeError handling from the function.
- A TypeError occurs when the function is applied to a value that is the wrong type. For example, in the below screenshot, the function is expecting a string value – if you pass it an integer, it will give you a TypeError. Another example would be running len(35), which would fail with a TypeError, because len is a function for string values
- A ValueError is where the operation receives a value that is the correct type, but an inappropriate value. For example, int(’12’) would work with no issue but int(‘cat’) would fail – although the data type is correct (a string), the value is not valid.
With all that in mind, you can see that I got an ‘Assertion Error: ValueError not raised by dateswap’. So, indeed, I have not handled ValueErrors at all in my script, let’s go back and fix that.
I have now gone back to fix my function. Without this unit testing, I may have never noticed that I don’t handle errors gracefully. I saw the function worked and may not have worried about error handling. Unit testing forced me to look into it.
There are plenty of methods in the unittest library, but there are three which I use more than any other:
- assertEqual: The expectation is that the two inputs into the test function should match. (e.g. expected output is equal to actual output).
- assertNotEqual: The expectation is that the two inputs into the test function should NOT match. (e.g. expected output is NOT equal to actual output).
- assertRaises: checks that an exception is raised in certain circumstances.
So there we have it, a little introduction to test cases in Python and PySpark – specifically looking at our UDF for converting date formats.