A crash course in Pandas

Ever since I launched my crash course in Pandas video course, I’ve been asked whether I would have a quick reference guide, so people didn’t have to re-watch videos to extract the line of code they might need. So, here we are!

In this article, we are going to cover a few of the major functions we need to use:

  1. import CSV files to Pandas
  2. Rename columns and drop columns we don’t need
  3. Create new columns to provide additional insight using functions
  4. Apply filters on the dataset with multiple conditions
  5. Save the dataset back as a new CSV

So, the dataset. Here are the fields I will be working with:

  • First_Name
  • Last_Name
  • Email
  • Gender
  • Salary
  • IP_Address

IMPORT DATA AND CHANGE COLUMN NAMES

The first thing we want to do is bring the CSV into Pandas, which is super simple. Here we simply specify the location of our CSV. I’ve provided 2 options, one where the first row of the file is the column header and one where it is not.

import pandas as pd
#The first row of the file is the column header
df = pd.read_csv('/home/Datasets/pandascc.csv')
#The first row of the file is not column header
df = pd.read_csv('/home/Datasets/pandascc.csv', header=None)

Now, whether you have already imported headers as part of the CSV read process or not, you may need to set / alter column names. First off, you can check the current column names, as shown below:

#First, check what the current names are
print(df.columns)
OUT: Index(['id', 'first_name', 'last_name', 'email', 'gender', 'ip_address',
       'salary'],
      dtype='object')

Next, you can assign new names:

df.columns = ['id', 'first', 'last', 'email', 'gender', 'ip', 'salary']

DROPPING A COLUMN

For the purposes of our analysis, we don’t need the IP field. So we can drop that using the below syntax:

df = df.drop('ip', 1)
df

You could also choose to drop the column without having to reassign df again:

df.drop('ip', axis=1, inplace=True)

You could even simply select only the fields you actually want:

df = df[['first', 'last', 'email', 'gender', 'salary']]

CLEANING UP DATA

In our data, we have a salary column. Unfortunately, the salary column includes the currency symbol ‘$’. This means, we can’t simply sum the column to get a total value. So, we need to clean it up first.

This is super simple. We create a new field in the dataframe called ‘tidy_salary’. That field contents is equal to the current salary field, where we replace the dollar sign with an empty string.

df['tidy_salary'] = df['salary'].str.replace('$','')

Next, I might want to cleanup my email addresses. I want to do some analysis to see whether a particular domain extension (e.g. Gmail, Hotmail, etc..) has an impact on salary. This is again quite a simple task. I’ll show you how to tackle it in two ways.

The first is nice and neat. We make a new field called ’email_provider’, which is the result of splitting the email field at the ‘@’ sign and taking the second chunk (index 1).

df['email_provider'] = df['email'].str.split('@').str[1]

I’m also going to show you a less efficient way of doing it, as this approach can help carry you through for plenty more complex analysis.

In the below we define a function called email. Within the function, we extract the email column from the dataframe row; split the string at the ‘@’ sign, taking the second chunk (index 1) and return the output. We call it on the final line, creating a new field called ’emailprov’ which is populated with the result of the email function.

As I say, it’s not really needed here. But if you wanted to work out something much more complex which was not possible with normal Pandas syntax, you would need an Apply function and so this is an example of using one.

def email(row):
    email = row['email']
    email = email.split('@')[1]
    return email
df['emailprov'] = df.apply(email, axis=1)

FILTERING DATA

Now, we have our nice shiny dataframe. I want to apply some filters. Initially, I’d lilke to see all the rows related to customers that are Female. This is as simple as the below. We make a new dataframe which I have called newdf. This is the result of filtering the current dataframe where the gender field is equal to ‘Female’.

newdf = df[(df.gender == "Female")]

We may also want to apply multiple filters. In the below, we are filtering where the gender is equal to Female and the emailprov field is equal to example.com

newdf = df[(df.gender == "Female") & (df.emailprov == "example.com")]

If we want to use an OR instead of an and, we simply swap the ‘&’ for a pipe ‘|’:

newdf = df[(df.gender == "Female") | (df.emailprov == "example.com")]

AGGREGATING DATA

Now let’s look at some aggregation! I want to work out the mean and total salary per gender in the dataset. To do that, I pass the list of aggregation functions I want to use in, as below.

import numpy as np
df.groupby("gender").agg({"tidy_salary": [np.mean, np.sum]})

If I want to do that for multiple field, I can:

import numpy as np
df['sss'] = df['tidy_salary']
df.groupby("gender").agg({"tidy_salary": [np.mean, np.sum], "sss": [np.mean, np.sum]})

Now, if you already know SQL, you may think it’s a bit of a pain learning a new syntax. But never fear! You can install this library which allows you to run SQL.

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM df LIMIT 10;").head()

So our query above would be:

from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT gender, sum(salary) as sum_salary, avg(salary) as avg_salary, sum(sss) as sum_sss, avg(sss) as avg_sss FROM df LIMIT 10;").head()

WRITING TO A FILE

Finally, let’s write our output back to a file.

df.to_csv('filepath.csv')

I hope you’ve found this useful!

Share the Post:

Related Posts