In the last article on performance tuning Apache Spark, we spoke about caching, UDF’s and limiting your input data, which are all simple ways to see some potentially drastic performance improvements.
In this post, we’re going to talk about the environmental factors that you can control to overcome some of the more common Spark issues.
The cache limit is never enough!
If you’re hitting your cache limit, read on..
There a ridiculous limitation of Spark applications. That limitation is that out of the box they only allow you to cache a small amount of data (1GB), which is simply not good enough.
Luckily, during our spark-submit command, we can disregard this limit. Here’s how:
spark-submit –conf spark.driver.maxResultSize=0
The above block sets the maxResultSize parameter to zero, which means unlimited. You can give it a real value if you like, but given that the data I work with is variable in size (significantly) and YARN controls the amount of resources a particular user can consume, I set it to be zero (unlimited).
Out of memory!
If you’re getting ‘Out Of Memory’ errors in Spark, read on…
The driver is the process where your main method runs. It takes your script; cuts it up into tasks and farms them out to executors (the worker nodes). They workers are launched at the beginning of a Spark application and usually run the full lifetime of the application. They execute the tasks they’re given and send their results back to the driver. So what happens (in order) is this:
- You submit your application
- The Spark Session is created
- The driver requests resources from YARN
- The driver checks your application & sends jobs to executors
- Executors execute the jobs and save the results
- Executor memory 512 mb
The executor memory parameter tells us how much memory each executor (worker node) will be able to utilize. By default it’s 512MB; which may not sound like a lot, but when you run on a multi-node cluster with a tonne of executors, it’s really does add up. That said, we can tune our executor memory, for example, in the below, I have set it to 70GB.
spark-submit –executor-memory 70g
We can do the same for the driver memory (which of course is the amount of memory the driver can consume). This defaults to 1GB, but we can overwrite it (in this example to 40GB) like below:
spark-submit –driver-memory 40g
No Module Named PySpark?
When you’ve been running PySpark jobs successfully for ages and you know that PySpark is installed on your cluster and then you try to run something a bit different (in my case, it was adding a UDF to my script for the first time), you may get this error.
This error means, your environment isn’t configured correctly – there is one (or more) machines in the cluster that may not have the required libraries. You can test this by running your script in client mode. If it runs, you’ve got an inconsistent environment.
spark-submit –master yarn –deploy-mode client
Take a timeout!
And, if you’re getting network timeout errors…
The network timeout parameter is the maximum amount of time permitted for network interactions. This defaults to 120 seconds, but in some cases you may experience a timeout script error. This can be simply rectified by adding the below to your spark-submit.
spark-submit –conf spark.network.timeout=800
I’ll keep this updated!
I’ll keep this list updated with more common errors as I run into them. I hope it was a useful read.