Speeding up Pandas apply functions using Swifter

Pandas is an excellent library for data analytics. However, when you get to work with really huge datasets, it just can’t hack it – the Pandas apply function runs on a single core, which constrains your computational efficiency. You’d usually be playing around with multiprocessing & Dask to try and optimise execution time – which, as data engineers, is not a very enjoyable task.

In this article, I want to talk about a library called Swifter. This library automatically evaluates and executes the right approach for your function: should it run as a plain Pandas apply? should we use Dask? should we vectorize?

The use of the tool is so simple. First, write your Pandas apply function as you always do. The below is a pointless function that I have written for the purposes of this test. It takes some data from a Pandas Dataframe and does some calculations on the field values.

from time import gmtime, strftime
import time
print(strftime("%Y-%m-%d %H:%M:%S", gmtime()))

def cleanup(row):
    file_type = row['file_type']
    dlbytes = row['dlbytes']
    ulbytes = row['ulbytes']
    
    if file_type == 'mp4':
        dom = (dlbytes **2)+((ulbytes **2))
    else:
        dom = (dlbytes **2)+((ulbytes **8))
    time.sleep(1)
    return dom

df['cleandom'] = df.apply(cleanup, axis=1)

print(strftime("%Y-%m-%d %H:%M:%S", gmtime()))

Now, we simply change df.apply to df.swifter.apply (as below) to utilise the swifter library.

import swifter

print(strftime("%Y-%m-%d %H:%M:%S", gmtime()))

def cleanup(row):
    file_type = row['file_type']
    dlbytes = row['dlbytes']
    ulbytes = row['ulbytes']
    
    if file_type == 'mp4':
        dom = (dlbytes **2)+((ulbytes **2))
    else:
        dom = (dlbytes **2)+((ulbytes **8))
    time.sleep(1)
    return dom

df['cleandom'] = df.swifter.apply(cleanup, axis=1)

print(strftime("%Y-%m-%d %H:%M:%S", gmtime()))

In this instance, we see the function executing using Dask – this is how Swifter has decided the function will run most optimally.


I would note that, this will not always be faster by default – there is an evaluation period where it decides what it should do to your data which is an overhead; in addition to the overhead caused by parallel processing – that overhead, can lead to slower execution times, where your data isn’t really big enough to justify using Dask. So, as with everything, it’s a bit of trial and error. But, with only one additional word included in your code, it’s a worthwhile test to conduct.

The summary is that, Swifter will use Pandas apply when it thinks it will be the fastest approach. For larger datasets, it will use Dask Parallel Processing. It decides, so the developer does not need to expend time finding the optimal approach.

You can find out more here.

Kodey