Data, hive

Keeping your Hive queries clean with CTEs

This is a super short & quick article about keeping your queries as readable and performant as possible by using CTEs. When we’re working with a number of different datasets it is really very temptying to use subqueries. However, when your queries start to get very large, this can become difficult to manage with a […]

Read more
Data, hive, ML

Working with dates in Apache Hive

Working with dates is one of those tedious things we frequently come across as data engineers. The frustration is that there are simply tonnes of date formats. Let’s list a few: Format Example MM/dd/yy 11/01/21 dd/MM/yy 01/11/21 yy/MM/dd 21/11/01 d/MM/yy 1/11/21 (no leading zeros) MMddyy 110121 ddMMyy 011121 yyyyMMdd 20211101 yyyy-MM-dd HH:mm:ss.SSS 01-11-2021 10:45:12.084 yyyy-MM-dd […]

Read more
Data

An introduction to structured data modelling

This is an introductory chapter of my upcoming book ‘Data Badass’ (pictured below): Data modelling is all about designing the way that your data is going to be organized. Think about it like building a house; you wouldn’t start laying bricks, without a plan. How would you know that the end result would meet your […]

Read more
Data

An introduction to data structures for aspiring data engineers

This is an introductory chapter of my upcoming book ‘Data Badass’ (pictured below): Data Types Data types are ways to classify data. For example, integer, string and boolean. These assignments / classifications determine what kind of values we can store in our data structures (outlined below). A string, is simply a text value. An example […]

Read more
Data, hive

A guide to windowing functions in Hive for data analysis

Windowing functions in Hive are super useful. They make analysis that would otherwise be challenging, much easier. Let’s look at examples of the key windowing functions in Hive. We will use the below dataset for our analysis, this is the ‘payments’ table. customerid department payment_amount 1 1 10000 2 1 9000 3 2 12000 4 […]

Read more
Data, hive

The Hive SQL Crash Course For Data Analysts

SQL is one of the most in-demand data skills. The language has been adopted by many database platforms, including Apache Hive. This article will serve as a crash couse into the key functionality of Hive QL. Throughout this article, we will use the two sample tables as the basis for our code. THE CUSTOMERS TABLE […]

Read more
Data, ML

The data scientist learning plan for 2021

When you look online at what it takes to become a data scientist, it’s enough to make your brain melt. You have people telling you that you need to be an expert statistician / mathematician; you need to be a top-level coder in 15 different languages; be proficient with every type of SQL/NoSQL database on […]

Read more
Data

An introduction to Awk

Project 1 (Print file): This is as simple use case, we are simply going to print the contents of a txt file. To do that, we simply use: awk ‘{print}’ filename.csv Here, the single quotes denote the start of the program and the curly braces show the start of an action within that program. In […]

Read more
Data

What’s all this about Flume + Kafka (Flafka)?

There is a concept called ‘Flafka’ through which we exploit the strengths of both Kafka and Flume to create a more robust ingestion process. By deploying Flafka, we take advantage of the out of the box connectors available in Flume and use it to connect to all of our data sources. The data is then […]

Read more
Data

An introduction to Kafka

Kafka is a low-latency messaging system. It takes data from one location, for example application log files and makes it available to other systems with very little delay (latency).  That may be useful in the below example, where Kafka makes our CRM and Google Analytics data available to Spark Streaming, which will carry out some […]

Read more
Data

An introduction to Flume

Flume is similar to Kafka in many ways. It takes data from its source and distributes it to its destination. The differentiation here is that Flume is designed for high-throughput log streaming into Hadoop (HDFS or HBase) – it is not designed to ship out to a large number of consumers. Flume, like Kafka is […]

Read more