Keeping your Hive queries clean with CTEs

This is a super short & quick article about keeping your queries as readable and performant as possible by using CTEs.

When we’re working with a number of different datasets it is really very temptying to use subqueries. However, when your queries start to get very large, this can become difficult to manage with a large number of sub queries.

One way to get around this is to use CTE’s (Common Table Expressions). These are temporary tables which we can create in our dataset.

In the below example, I have created q1 and q2. These are temporary tables that live for the life of the query. We can then reference these temporary tables later on (as many times as we like), which leads to less duplication in our code, makes it easier to read and more performant if the CTE is referenced more than once.

WITH 
q1 AS (SELECT field1, field2, field3, case when field4 = 4 then 'yes' else 'no' end as flag from database.table),
q2 AS (SELECT field5, sum(field6) as n from database.table2 group by field5)

SELECT * from q1
left join q2 on q1.field1 = q2.field5

Recently, I have had a piece of code which needed to run the same heavy piece of code twice by using sub queries – so it could join in two places. By utilising CTE’s we ensure that this snippet does not need to be copied in the query more than once and it only runs once too. Leading to readability and performance gains.

Let’s look at a simpler example. In the below, I have two queries, one which sums the order values and right joins to find only customers that are > 21 years of age and the other which sums the refund values for the same customers. You can see that the right join query appears twice, leading to some redundancy.

select * from (

select cid, sum(gbp) as order_value from orders
right join (select distinct userid from customers.details where age > 21)o on o.userid = orders.cid
group by cid)x

left join (
select cid, sum(gbp) as refund_value from refunds
right join (select distinct userid from customers.details where age > 21)o on o.userid = orders.cid
group by cid)q

on q.cid = x.cid

We could use a CTE as below. Here, we define the query for the distinct users once, but use it twice. Not only is there less code redundancy and better performance, but there is also the benefit that this is more readable as we define the three tables upfront and it’s clear what is happening, which sometimes in a string of nested queries, it is not.

with 

users as (select distinct userid from customers.details where age > 21),

refunds as (select cid, sum(gbp) as refund_value from refunds right join users on users.userid = refunds.cid group by cid),

orders as (select cid as cid2, sum(gbp) as order_value from orders right join users on users.userid = orders.cid group by cid),

select cid, refund_value, order_value from refunds 
left join orders on orders.cid2 = refunds.cid

Overall, if you are planning to industrialise your query; to be read by other people, it’s important to try and avoid using sub queries & rather use CTE’s where possible.

Kodey