hive

Parameters to make your Hive queries perform better

Hive, in my experience, is a platform which can have extremely variable performance, which can make it difficult to predict how long your jobs are going to take to run. Below are a few key optimisation techniques we can use to help make our lives a bit better! Choose the right file type First of […]

Read more
Data, hive

Keeping your Hive queries clean with CTEs

This is a super short & quick article about keeping your queries as readable and performant as possible by using CTEs. When we’re working with a number of different datasets it is really very temptying to use subqueries. However, when your queries start to get very large, this can become difficult to manage with a […]

Read more
Data, hive, ML

Working with dates in Apache Hive

Working with dates is one of those tedious things we frequently come across as data engineers. The frustration is that there are simply tonnes of date formats. Let’s list a few: Format Example MM/dd/yy 11/01/21 dd/MM/yy 01/11/21 yy/MM/dd 21/11/01 d/MM/yy 1/11/21 (no leading zeros) MMddyy 110121 ddMMyy 011121 yyyyMMdd 20211101 yyyy-MM-dd HH:mm:ss.SSS 01-11-2021 10:45:12.084 yyyy-MM-dd […]

Read more
Data, hive

A guide to windowing functions in Hive for data analysis

Windowing functions in Hive are super useful. They make analysis that would otherwise be challenging, much easier. Let’s look at examples of the key windowing functions in Hive. We will use the below dataset for our analysis, this is the ‘payments’ table. customerid department payment_amount 1 1 10000 2 1 9000 3 2 12000 4 […]

Read more
Data, hive

The Hive SQL Crash Course For Data Analysts

SQL is one of the most in-demand data skills. The language has been adopted by many database platforms, including Apache Hive. This article will serve as a crash couse into the key functionality of Hive QL. Throughout this article, we will use the two sample tables as the basis for our code. THE CUSTOMERS TABLE […]

Read more