Data serialisation plays a critical role in the performance of our data analytics scripts. In this article, we’ll discuss serialisation as a whole & then dive into Kryo serialisation in PySpark.
What is serialisation?
Often, we store our data in arrays, tables, trees, classes and other data structures, which are inefficient to transport. So, we serialise them – that is, we collapse the data into a series of bytes which contain just enough information to reconstruct the data in its original form when it reaches its destination.
We generally serialise data when we:
- Persist our data to a file
- Store our data in databases
- Transfer data through a network (e.g. to other nodes in a cluster)
Back in the 1990’s, there was a move away from a byte stream of data*, to make data serialisation both human readable and programming language agnostic – which is a real benefit when you communicate data between two systems that operate on different programming languages. Formats we could use include: XML, CSV and JSON.
These kind of formats, although human readable are less compact than a byte stream, but in general, that isn’t a huge issue as compute power is so cheap these days.
*Byte streams contain bytes, which represent text files, images or videos.
So, it’s going to make the transport between Spark cluster nodes faster?
Yep! That’s the plan. When data is encoded into a new, easily transmittable format, it makes it 1) faster to shuffle and 2) easier to handle when it’s received as it’s language agnostic.
Alright, how do I do it?
By default in Spark, your data is already serialised using the Java serialiser (using the ObjectOutputStream framework). But it’s slow. Making a couple of tweaks to your can make a big improvement.
We can use the Kryo serialiser, which claims to be up to 10x faster than the Java alternative; which may well be true as my overall application executed 20% faster when switching to Kryo. That bought execution time down to 33 minutes.
The way I think of it is this – there is no harm in trying Kryo out – if it works out that it’s faster, great! If not, remove the two lines of code & revert back to Java.
It’s pretty easy to set this up. As below, when we configure out Spark session, we simply set the config to choose Kryo serializer. Here, I have also set registrationRequired to False as otherwise we will have to declare the classes that we want Kryo to serialize. I want it to serialize everything, so I don’t want to register specific classes.
def create_session(appname):
spark_session = SparkSession\
.builder\
.config(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)\
.config(“spark.kryo.registrationRequired”, “false”)\
.appName(appname)\
.master(‘yarn’)\
.enableHiveSupport()\
.getOrCreate()
return spark_session