Keeping your Hive queries clean with CTEs

This is a super short & quick article about keeping your queries as readable and performant as possible by using CTEs.

What are CTEs?

CTEs, or Common Table Expressions, are a powerful tool within the SQL language. CTEs are defined and referenced by the WITH clause in an SQL statement, which allows you to construct a named sub-query block to be used within a larger query. They provide the same functionality as a view, but can be referenced multiple times within the same statement, while views can only be used once.

CTEs allow users to define multiple datasets within a single query, simplifying data management, readability and analysis. They can also be used as an alternative to joins and temporary tables, providing flexibility and efficient execution of complex queries. CTEs are a valuable feature that make data querying simpler, faster and more efficient.

When we’re working with a number of different datasets it is really very tempting to use subqueries. However, when your queries start to get very large, this can become difficult to manage with a large number of sub queries.

One way to get around this is to use CTE’s (Common Table Expressions). These are temporary tables which we can create in our dataset.

How do we use CTEs?

In the below example, I have created q1 and q2. These are temporary tables that live for the life of the query. We can then reference these temporary tables later on (as many times as we like), which leads to less duplication in our code, makes it easier to read and more performant if the CTE is referenced more than once.

q1 AS (SELECT field1, field2, field3, case when field4 = 4 then 'yes' else 'no' end as flag from database.table),
q2 AS (SELECT field5, sum(field6) as n from database.table2 group by field5)
SELECT * from q1
left join q2 on q1.field1 = q2.field5

Recently, I have had a piece of code which needed to run the same heavy piece of code twice by using sub queries – so it could join in two places. By utilizing CTE’s we ensure that this snippet does not need to be copied in the query more than once and it only runs once too. Leading to readability and performance gains.

Let’s look at a simpler example. In the below, I have two queries, one which sums the order values and right joins to find only customers that are > 21 years of age and the other which sums the refund values for the same customers. You can see that the right join query appears twice, leading to some redundancy.

select * from (
select cid, sum(gbp) as order_value from orders
right join (select distinct userid from customers.details where age > 21)o on o.userid = orders.cid
group by cid)x
left join (
select cid, sum(gbp) as refund_value from refunds
right join (select distinct userid from customers.details where age > 21)o on o.userid = orders.cid
group by cid)q
on q.cid = x.cid

We could use a CTE as below. Here, we define the query for the distinct users once, but use it twice. Not only is there less code redundancy and better performance, but there is also the benefit that this is more readable as we define the three tables upfront and it’s clear what is happening, which sometimes in a string of nested queries, it is not.

users as (select distinct userid from customers.details where age > 21),
refunds as (select cid, sum(gbp) as refund_value from refunds right join users on users.userid = refunds.cid group by cid),
orders as (select cid as cid2, sum(gbp) as order_value from orders right join users on users.userid = orders.cid group by cid),
select cid, refund_value, order_value from refunds 
left join orders on orders.cid2 = refunds.cid

Overall, if you are planning to industrialize your query; to be read by other people, it’s important to try and avoid using sub queries & rather use CTE’s where possible.

Share the Post:

Related Posts