Okay, so the first thing I should note is that you should avoid UDF’s (User Defined Functions) like the plague unless you absolutely have to; I have spoken about why that is here.

Now that I’ve warned you, let’s talk about what a UDF is. A UDF is a User Defined Function, it’s a function which is entirely coded by the user, rather than using out of the box functions available with PySpark. It gives you ultimate flexibility in what you want to do with the data but does come at the cost of performance. Sometimes though, there just isn’t a good way to achieve your desired outcome, without breaking out a good old UDF, so the performance hit is a necessary evil.

Let’s go through an example. In the below, I have defined a function called do_they_live_here. I define this just like I would any Python function, where the function inputs are contained within the parenthesis – in this case, the input is going to be city and hometown. Within the function, I have a simple IF statement that determines if they live in Tokyo; are vising Tokyo or are somewhere else in the world.

Great! We have our function. Now let’s register our UDF. This is an absolutely required stage, we have to tell Spark that we’re creating a UDF called ‘myudf’. That UDF uses the do_they_live_here function and defines the return type as string.

Finally, I am going to create a new column in my dataframe, which is the result of running myudf with my df3 fields ‘city’ and ‘hometown’ being passed in.

The flow is as below. We pass our values into the UDF, which passes them into the python function we defined which then returns a value for us to use.

City + Hometown –> UDF –> python function –> return value

So that’s it. A quick guide on using UDF’s in Spark. This is a super simple way to add complex data cleansing or other complex logic to your Spark scripts.