Golang

Golang: Variable Assignment & Data Types

Variable assignment and data types are those kind of topics that are absolutely necessary to cover. If we think about Python for a second, we can happily do this: j= ‘Bob Smith’j= 13 Here we have assigned a variable called j to be a string value of Bob Smith and then have immediately assigned the […]

Read more
Golang

Golang: Getting to grips with Hello World

Okay okay I know, hello world is the most boring application it’s possible to make in any programming language BUT there is a reason that it’s always used, it is quite a good usecase. In the below, we have the Golang implementation of Hello World and actually, we can derive quite a lot of information […]

Read more
Spark

Using Kryo Serialization to boost Spark performance by 20%

Data serialisation plays a critical role in the performance of our data analytics scripts. In this article, we’ll discuss serialisation as a whole & then dive into Kryo serialisation in PySpark. What is serialisation? Often, we store our data in arrays, tables, trees, classes and other data structures, which are inefficient to transport. So, we […]

Read more
Spark

Repartition vs Coalesce

When we ingest data into Spark it’s automatically partitioned and those partitions are distributed across different nodes. So, imagine we have bought some data into Spark, it might be split across partitions like this: Partition 1 Partition 2 Partition 3 1, 7, 5, 9, 12 15, 21, 12, 17, 19 45, 32, 25, 99 When […]

Read more
Spark

Overcoming Spark Errors

In the last article on performance tuning Apache Spark, we spoke about caching, UDF’s and limiting your input data, which are all simple ways to see some potentially drastic performance improvements. In this post, we’re going to talk about the environmental factors that you can control to overcome some of the more common Spark issues. […]

Read more