Data Skew is a real problem in Spark. It seems like the sort of thing which should be solved automatically by the job manager but unfortunately, it isn’t. Skew is where we have a given partition which contains a huge amount more data than others – leaving one executor to process a lot of data, […]
Read more