Spark

Improving performance when calculating percentiles in Spark

Performance is a major concern when you’re working in a distributed environment with a massive amount of data. I’ve discussed Spark performance in quite a lot of detail before here and here. Today I am going to talk specifically about percentiles. Because, calculating percentiles over large distributed datasets is a mammoth task. You will likely […]

Read more
Spark

Overcoming Futures Timeout & Read Timeout errors in PySpark

This article covers a less than orthodox method for handling resource constraints in your PySpark code. Consider the below scenario. Here, we are loading data from four sources doing some joins & aggregations and producing an output. The problem is, we keep getting timeout errors because the data is just so large. After tuning the […]

Read more
Spark

Handling data skew / imbalanced partitions in Pyspark

Data Skew is a real problem in Spark. It seems like the sort of thing which should be solved automatically by the job manager but unfortunately, it isn’t. Skew is where we have a given partition which contains a huge amount more data than others – leaving one executor to process a lot of data, […]

Read more
Spark

A crash course in UDF’s in Pyspark

Okay, so the first thing I should note is that you should avoid UDF’s (User Defined Functions) like the plague unless you absolutely have to; I have spoken about why that is here. Now that I’ve warned you, let’s talk about what a UDF is. A UDF is a User Defined Function, it’s a function […]

Read more
Spark

Fault Tolerant Friday: Creating robust pipelines in Python

When you’re dealing with tonnes of different datasources, issues crop up all the time: the datasource hasn’t been populated; the schema has changed; the schedule changes or the file format changes. The key with our jobs is to build in enough checks to at the very least warn us if something fails, so we can […]

Read more
Spark

Starter Sundays: PySpark Basics, Selecting & Filtering Data

Welcome to this weeks Starter Sunday – which is where I cover off some basic concepts of Spark, Python or Hive. Today, we’re looking at selecting and filtering data from our dataframes in Spark, specifically, Pyspark. Select specific columns Below, we have a dataframe called df_select which will take just two columns from the dataframe […]

Read more
Spark

Spark Streaming: Introduction to DStreams

In my last article, I showed an example of a Spark Streaming application for ingesting application log files. Today, I am going to cover off an overview of DStreams in Spark Streaming. In subsequent articles, we’ll really look in detail at some of the main Spark Streaming concepts. We can source data from loads of […]

Read more
Spark

Deploying a log monitor with Spark Structured Streaming

In this article I am going to cover how to create a simple streaming log monitor using Spark Streaming. In subsequent posts, I’ll go deeper on some of these aspects. The first thing we need to do, as always is to create a Spark Session. There isn’t much new here, except that we import the […]

Read more