When we ingest data into Spark it’s automatically partitioned and those partitions are distributed across different nodes. So, imagine we have bought some data into Spark, it might be split across partitions like this:
|Partition 1||Partition 2||Partition 3|
|1, 7, 5, 9, 12||15, 21, 12, 17, 19||45, 32, 25, 99|
When we work with data in Spark, we want to ensure we have the right number of partitions to achieve optimal parallelization and hence performance.
We have two key ways to tune the number of partitions ourselves (I have no idea why Spark doesn’t do this for you automatically), the options we have are coalesce or repartition.
Note: the optimal number of partitions is 2 to 3 times more than the number of cores you have. E.g. if your cluster has 50 cores, you should have at least 50 partitions.
The command to coalesce is simple. In the below, the dataframe will be coalesced to have only two remaining partitions.
So, we could say that the output would look like the below.
|Partition 1||Partition 2|
|1, 7, 5, 9, 12, 15, 21, 12, 17, 19||45, 32, 25, 99|
When we coalesce, data is moved from one partition to another – in other words, it keeps existing partitions and just adds more data to them from other partitions to reduce the number of overall partitions.
Note that we can only decrease the number of partitions with Coalesce, we can’t increase.
Unlike Coalesce, when we repartition, we can increase or decrease the number of partitions. The differences don’t stop there though – repartitioning our data does a full shuffle. That means, it drops all existing partitions and creates new, evenly distributed partitions.
This is important, because if you look at the above coalesce example we have an unbalanced set of partitions – one has way more data in it than the other & Spark works at its best when it’s got evenly distributed data to work with.
That said; we use coalesce sometimes because it’s does far less data shuffle & is hence much faster to complete than repartitioning. It takes a bit of trial and error to figure out whether the upfront time-add to repartition is worth the time-saving later in the process.
The syntax to repartition is simple too, the below would give us 5 new partitions. Note that this is higher than the original 3 partitions, demonstrating that repartitioning can both increase or decrease the number of partitions we are working with.
A really nice feature of repartitioning is that we can repartition by column, which is similar to indexing a column in a relational dataset. It makes it super quick to extract insight related to that column.
If we need to join two datasets, it can improve our application performance significantly if both datasets are repartitioned on the same field. In the below, we’ve partitioned by the accountid field.
And more of a real example is to repartition based on multiple fields (e.g. Year, Month Day).
df.repartition(‘year’, ‘month’, ‘day’)
How do we check how many partitions were created?
We can check how many partitions are created automatically and can validate our repartitioning efforts have worked correctly by running:
As you can see below (on the left of the screenshot), this has output to 5 files – one file per partition.
Partitioning enables us to maximize parallelization of our Big Data cluster leading to big performance gains. There is a limit though, having 5,000 partitions on a cluster with only 200 CPU cores will lead to degraded performance. Spark recommends 2 to 3 tasks per CPU core in your cluster, but it always takes a bit of trial and error to find the most suitable number of your environment.
Either way, repartitioning helps us greatly in the pursuit of performance.