Welcome to this weeks Starter Sunday – which is where I cover off some basic concepts of Spark, Python or Hive.

Today, we’re looking at selecting and filtering data from our dataframes in Spark, specifically, Pyspark.

Select specific columns

Below, we have a dataframe called df_select which will take just two columns from the dataframe called df.

Rename columns

Now, we have an example where we rename the first_name column to be simply fn. This can be super useful if you want naming to be consistent across all your systems & reports.

Filter where string =

Here, we create a new dataframe called df_filter. This is the result of filtering the dataframe called df, where the first_name column is exactly equal to Alvera.

Filter where field LIKE

The like filter is really useful too – it’s kind of like saying ‘return all records where the name includes ‘lver’, Here, we have % on either side of the string we’re looking for. That means, anything can come before or after, just so long as the string includes ‘lver’. We could say ‘lver%’ , which would mean, the word must start with ‘lver’ but it can end with anything.

Filter where field ends with

In the below example, we filter the dataframe to show all records where the first_name field ends with ‘din’. This is just the same as using like(‘%din’) – because the percent sign signifies that the word can start with anything but must end with ‘din’. We then, below, have the same concept for startswith.

Filter where field value is between X and Y

Below we are filtering the dataframe called df to return only records where the ID is between the values of 1 and 5.

Filter where field is equal to an item in the list

Here, we say that we want to return all records where the first_name is in the list of names we provided. In this case, return all records where the first name is either Aldin or Valma.

Multiple filters

Below we have two examples of applying multiple filters. To use OR we use the pipe (|) and and is signified by &.