Pandas is an excellent library for data analytics. However, when you get to work with really huge datasets, it just can’t hack it – the Pandas apply function runs on a single core, which constrains your computational efficiency. You’d usually be playing around with multiprocessing & Dask to try and optimise execution time – which, as data engineers, is not a very enjoyable task.
Swifter is an open-source python library that provides an incredibly efficient and easy way to process large amounts of data. It is designed to maximize the speed of pandas, one of the most popular libraries for data manipulation, by optimizing certain operations.
The most notable feature of Swifter is its ability to accelerate pandas operations. Using Swifter, pandas operations can be performed up to 20 times faster than using regular pandas functions, allowing users to quickly manipulate large datasets and make more informed decisions.
Swifter has been developed with a focus on improving performance while maintaining the same interface as pandas. This allows users to utilize existing code and requires minimal effort to get up and running with the new library.
Swifter also provides a number of other features to enhance the user experience. It provides a wide variety of utility functions that can be used to manipulate data, such as filtering and sorting, in addition to a number of specialized functions that are tailored to certain data types.
Speeding up Pandas apply functions
For the purposes of this article, we will look at the Pandas apply function. This library automatically evaluates and executes the right approach for your function: should it run as a plain Pandas apply? should we use Dask? should we vectorize?
The use of the tool is so simple. First, write your Pandas apply function as you always do. The below is a pointless function that I have written for the purposes of this test. It takes some data from a Pandas Dataframe and does some calculations on the field values.
from time import gmtime, strftime import time print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) def cleanup(row): file_type = row['file_type'] dlbytes = row['dlbytes'] ulbytes = row['ulbytes'] if file_type == 'mp4': dom = (dlbytes **2)+((ulbytes **2)) else: dom = (dlbytes **2)+((ulbytes **8)) time.sleep(1) return dom df['cleandom'] = df.apply(cleanup, axis=1) print(strftime("%Y-%m-%d %H:%M:%S", gmtime()))
Now, we simply change df.apply to df.swifter.apply (as below) to utilise the swifter library.
import swifter print(strftime("%Y-%m-%d %H:%M:%S", gmtime())) def cleanup(row): file_type = row['file_type'] dlbytes = row['dlbytes'] ulbytes = row['ulbytes'] if file_type == 'mp4': dom = (dlbytes **2)+((ulbytes **2)) else: dom = (dlbytes **2)+((ulbytes **8)) time.sleep(1) return dom df['cleandom'] = df.swifter.apply(cleanup, axis=1) print(strftime("%Y-%m-%d %H:%M:%S", gmtime()))
In this instance, we see the function executing using Dask – this is how Swifter has decided the function will run most optimally.
I would note that, this will not always be faster by default – there is an evaluation period where it decides what it should do to your data which is an overhead; in addition to the overhead caused by parallel processing – that overhead, can lead to slower execution times, where your data isn’t really big enough to justify using Dask. So, as with everything, it’s a bit of trial and error. But, with only one additional word included in your code, it’s a worthwhile test to conduct.
The summary is that, Swifter will use Pandas apply when it thinks it will be the fastest approach. For larger datasets, it will use Dask Parallel Processing. It decides, so the developer does not need to expend time finding the optimal approach.
You can find out more here.