Koalas = Pandas simplicity + Spark’s scalability

Whenever you start poking around a dataset to see what you’ve got to play with, you probably immediately write ‘import pandas as pd’ – why? Because Pandas is the gold standard in data analysis libraries; it’s so simple & yet so powerful.

The problem is, Pandas just doesn’t scale, so if you’re going to be running across big datasets, you’ll need to break out the big guns.. Spark.

But what if, you didn’t need to? What if, by changing one line of code, you could make your Pandas code scale, as if it were a native Spark job? It’s not just a dream, it’s reality – we can leverage Koalas to make this happen.

How do I use Koalas?

First of all, it’s a simple pip install koalas to get the library installed and ready to use. Now, let’s look at some code. I’ve taken this directly from here as it’s a really nice simple example to demonstrate the concept.

So what does this code do? This is a little standard Pandas script that:

  1. It imports the Pandas library
  2. We generate a dataframe
  3. We rename the columns
  4. We do some calculations

import pandas as pd
df = pd.DataFrame({‘x’: [1, 2], ‘y’: [3, 4], ‘z’: [5, 6]})
# Rename columns
df.columns = [‘x’, ‘y’, ‘other’]
# Do some operations in place
df[‘calculated’] = df.x * df.y

So now to use Koalas, we just need to make a minor change. As you can see from the below, I’ve emboldened those changes. We simply, define a dataframe as ks.DataFrame rather than pd.DataFrame and import the Koalas library. That’s it!

import databricks.koalas as ks
df = ks.DataFrame({‘x’: [1, 2], ‘y’: [3, 4], ‘z’: [5, 6]})
# Rename columns
df.columns = [‘x’, ‘y’, ‘other’]
# Do some operations in place
df[‘calculated’] = df.x * df.y

So why is this good?

Pandas is an amazing library. It has some really nice features and it’s very intuitive for the end user.

Quite often, we develop code locally to find some sort of insight from the data & then decide we need to scale it up. That would require a re-write of our code to leverage Spark dataframes, but that isn’t the case with Koalas. If we wrote our code in Pandas, why rewrite it? We obviously wrote it in Pandas because we are comfortable with that library.

So between Python, Pandas and Koalas, we’re gradually working our way through the zoo.

Here’s a video giving an overview of the Koalas project.

Previous Article

Solving the UDF performance problem with Pandas

Next Article

Data Cleaning in PySpark

Related Posts