The Hive SQL Crash Course For Data Analysts

SQL is one of the most in-demand data skills. The language has been adopted by many database platforms, including Apache Hive. This article will serve as a crash couse into the key functionality of Hive QL.

Throughout this article, we will use the two sample tables as the basis for our code.

THE CUSTOMERS TABLE

FIELD NAMEDATA TYPEDESCRIPTION
CUSTOMER_IDINTEGERTHE ID OF A CUSTOMER
CUSTOMER_NAMESTRINGTHE FULL NAME OF THE CUSTOMER (FIRSTNAME SURNAME)
FIRST_LINE_ADDRESSSTRINGHOUSE NUMBER & STREET NAME
TOWNSTRINGTOWN NAME
POSTCODESTRINGPOST CODE WITH SPACE (XX22 XXX)
AGEINTEGERTHE AGE IN WHOLE YEARS
GENDERSTRINGM or F
SALARYFLOATANNUAL SALARY OF THE CUSTOMER


ORDERS TABLE

FIELD NAMEDATA TYPEDESCRIPTION
ORDER_IDINTEGERID OF THE ORDER
CUSTOMER_IDINTEGERID OF THE CUSTOMER
ORDER_VALUEFLOATVALUE INCLUDING PENCE
ORDER_TAXSTRINGTAX CALCULATED ON ORDER
ORDER_DATESTRINGYYYY-MM-DD ORDER DATE

Selecting & Filtering Data

The first thing we want to to is simply select everything from the customers table, to see what kind of data we have. To do that, we simply use the below code, where * means ‘everything’.

select * 
from customers

Now, we want to find all the records, specifically for customers who are female. We therefore add a where statement. Now, the below reads as ‘select everything from the customers table, where the gender field is equal to F’.

select * 
from customers 
where gender = 'F'

I can refine this more, I can select, all the records from the customers table, where gender is female and their salary is over 50,000.

select * 
from customers 
where gender = 'F' and salary > 50000

Sometimes it can be a bit more complex, what if, I want to select everything from the customers table, where the customer is female and their salary is above 50,000 and their town is either London or Manchester. Notice the use of brackets here. This ensures that the statement is executed correctly.

select * 
from customers 
where gender = 'F' and salary > 50000 
and (town = 'London' or town = 'Manchester')

Now, we want to select a subset of the fields in the table, we just want town, so we can understand where these customers live. The below selects only that field.

select town 
from customers 
where gender = 'F' and salary > 50000 
and (town = 'London' or town = 'Manchester')

It turns out, we have lots of customers residing in the same town, so the towns are duplicated. Our result looks a bit like: London, London, Manchester, London, Birmingham…. So, we want to select only the unique towns. Here, we can use the distinct operator. This removes all the duplicated town names from the result.

select distinct town 
from customers 
where gender = 'F' and salary > 50000 
and (town = 'London' or town = 'Manchester')

In our where statements, we can use:

  • > : greater than
  • < : less than
  • >= : greater or equal to
  • <= : less than or equal to
  • in : where the value is in a list (e.g. name in (‘Sam’, ‘Bob’, ‘Sarah’)
  • between : e.g. where salary between 10000 and 20000

Aggregating our data

By this point, you should be comfortable with selecting data from tables and applying multiple filters to that data. Imagine if we wanted to find out a few things:

  • How many customers do we have that are Female and earn more than 50,000?
  • What is the minimum, maximum and average salary for all of our customers?
  • How many customers do we have in each town?

How many customers do we have that are Female and earn more than 50,000?

Let’s start with the first question. The below query would answer this. It reads as: Count every record in the database table where the customer is female and earns more than 50,000.

You will notice the ‘as customer_count’ part of the statement. This simply means, the result of the count will be stored in a column called ‘customer_count’.

select count(*) as customer_count 
from customers 
where gender = 'F' and salary > 50000

The above works, if we assume that the customers in the database only appear once. But, what if, a customer has duplicated records in the table for some reason? In this instance, we can use count(distinct(customer_id)), as below. This will count the number of unique customer ID’s that appear in the table and disregard any duplicate values.

select count(distinct(customer_id)) as customer_count 
from customers 
where gender = 'F' and salary > 50000


What is the mimimum, maximum and average salary for all our customers?

We’ve now been asked to calculate the minimum, maximum and average salary for all of our customer base. We can do that, using the following out-of-the-box functions, available in SQL.

select min(salary) as minimum_salary, 
max(salary) as maximum_salary, avg(salary) as average_salary 
from customers


How many customers do we have in each town?

You know how to find how many customers we have in total, right? You’d just use the below statement, to count all unique customer ID’s in the customers table.

select count(distinct(customer_id)) as customer_count 
from customers 

If we wanted to group that by town and say ‘for each town, how many customers do we have?’, we would adjust the query as below. We can read this as ‘for each town, count the number of unique customer_id values’. Notice, at the end, we have to specifically declare that we want to group by town (i.e. know the answer, per town).

select town, count(distinct(customer_id)) as customer_count 
from customers 
group by town

Ordering and sorting our data

We’ve now been asked to idenfiy the 10 customers with the highest salary. For this, we must use order by. You may be aware of a sort by function too – the difference here is, the order by function will guarantee that the result of your query is in order. Whereas, the sort by statement, will only ensure the ordering on a given reducer (a concept in parallelized computation) – for almost all use-cases, order by is what you need.

In the below, I have selected everything from the customers table, ordered by salary in a descending order and have limited the output to 10 rows (10 customers). This will provide you with the top 10 customers, from a salary perspective.

select * 
from customers 
order by salary desc 
limit 10

Casting data to different data types

Often, the data in your table is not defined correctly. For example, your date field may be defined as a string, which means you can’t use date functions on it (e.g. today minus 1), so you may need to cast that field to a datetime format.

In our orders table, we have a date field, defined as a string, with the format yyyy-mm-dd. To cast that as a datetime format, we can simply use the below function:

select cast(order_date as datetime) 
from orders

Similarly, we have the order_tax field, defined as a string. During our query, we want to be able to make it clear that this value is a float. We can convert it from string, to float, using the below:

select cast(order_tax as float) 
from orders

Joining Tables

We have a table containing all of the orders placed but we don’t store the customer details. What if I wanted to join them together?

The first query is selecting all of the fields from the orders table and the town field from the customers table. The reason I can select a field from a different table is, I have joined the two tables together.

I have used a left join. That means, show me absolutely all of the orders, irrespective of whether they have a matching customer ID in the customers table.

select order_id, customer_id, order_value, 
order_tax, order_date, town 
from orders
left join customers on customers.customer_id = orders.customer_id

If I were to update the statement to a right join, it would show me all of the orders ONLY where there is a matching customer_id in the customers table.

select order_id, customer_id, order_value, 
order_tax, order_date, town 
from orders
right join customers on customers.customer_id = orders.customer_id

We can also use an inner join, where it will show me only the orders of customer_id’s that exist in both the orders and customers table.

select order_id, customer_id, order_value, 
order_tax, order_date, town 
from orders
inner join customers on customers.customer_id = orders.customer_id

Case Statements

Case statements are super powerful tools. It’s gives us the ability do derive new values, based on a set of conditions.

Consider a basic example. If our order value is greater than 200 then we will assign it the flag of ‘High Value’, if it’s below 20, we will assign it the value of ‘Low Value’ and everything else will be assigned as ‘Medium Value’.

In the below, I have written a query that does exactly that. It says ‘select all of the fields from the orders table and add a new field called order_flag, which is calculated using this logic’.

select *,
case 
when order_value >200 then 'High Value'
when order_value <20 then 'Low Value'
else 'Medium Value'
end as order_flag
from orders

Nested Queries

Now, we need to start bringing in joins and also a new concept called nested queries. You’ll remember the below query. We joined the orders and customers tables, so we knew the town the order was being shipped to.

If we wanted to calculate the shipping cost on our orders, we may have this logic: If the town is equal to London, the shipping cost should be 10% of the order value. If the town is equal to Glasgow, the shipping cost should be 5% of the order value and if the town is equal to Manchester then shipping should be free.

select order_id, customer_id, order_value, 
order_tax, order_date, town 
from orders
inner join customers on customers.customer_id = orders.customer_id

We can think of this table as our foundation. It’s the base upon which we can build our query. In the below, I have an example of a nested query. I am simply selecting all of the fields from the query I built above. Notice that at the very end, I have ‘ordertowns’.

This means, everything within the brackets, becomes an in-memory table, which I have called ‘ordertowns’.

select * from (
select order_id, customer_id, order_value, 
order_tax, order_date, town 
from orders
inner join customers on customers.customer_id = orders.customer_id)ordertowns

Now then, I could create my logic in the outer query. You can see that I have created our new field, in the outer query, utilizing the ‘in-memory’ table, called ordertowns.

select *, 
case
when town = 'London' then order_value * 0.1
when town = 'Glasgow' then order_value * 0.05
when town = 'Manchester' then 0
end as shipping_cost
from (
select order_id, customer_id, order_value, 
order_tax, order_date, town 
from orders
inner join customers on customers.customer_id = orders.customer_id)ordertowns


Sub Queries in Where Condition

Sub_Queries really come into their own when we want to answer questions like ‘show me all of the orders associated to customers who earn over 50,000’.

Here, we simply use the sub query in the where statement. It says ‘show me customer_id’s which exist in the result of the query within the brackets.

select * 
from orders 
where customer_id in (select customer_id from customers where salary > 50000)

Unions

If you had two tables, 2020_customers and 2021_customers and you wanted to stack the tables together (one on top of the other) and select the unique customers from both, you could use a union.

You can either, run a union all like below, to get all values, including duplicates.

select * from 2020_customers
union all
select * from 2021_customers

Or you can use a union distinct, which handles the dupicates automatically.

select * from 2020_customers
union distinct
select * from 2021_customers

Rank

Ranking data is always useful. Looking back at or question about the 10 customers with the highest salary – what if we wanted to work out, the 10 customers with the highest salary, in each town.

For that, we can use the rank function. Here, we have assigned a ranking partitioned by town. So that means, for each town, order salary in descending order and rank them.

If we have 2 towns, London and Glasgow, it means, we will have two of every rank. There will be a rank 1 for Glasgow and a rank 1 for London.

select *,
rank() over (partition by town order by salary desc) rank
from customers
where rank <=10

String Functions

The LIKE function in SQL is one of the most useful features we have. Let’s say, we are interested in a particular postcode area. The postcode must begin with XA but we don’t care what comes after it.

We can use the like function, with a wildcard. The % sign means ‘anything’. So this says ‘begin with XA and end in anything’. Specifically, the % sign means 0 or more characters.

select * 
from customers 
where postcode like 'XA%'

If we were interested in postcodes that ended in PP, we could use the below. Notice that the % sign has now moved to the beginning. It means, ‘start with anything, but must end in PP’.

select * 
from customers 
where postcode like '%PP'

If we were interested in postcodes that included the sequence ‘GNT’ anywhere within the postcode, we could use the below. Note that we now have % sign’s on both sides of the string. This means, it can start with anything, it can end with anything but it must include GNT somewhere.

select * 
from customers 
where postcode like '%GNT%'

Other wildcards which we can use in place of the % sign are sometimes useful. For example: _ means a single character, so we could look for postcodes like ‘XX_ 9XX’, which would look for any pattern (e.g. XX1 9XX; XX2 9XX etc..).

RLIKE is another useful operator. It’s a single statement to say ‘if the customer name is like Sam (e.g. Samuel, Samantha, Sam) or Ron (e.g. Ronald, Ron, Ronny), then pull the data. We simply say:

select * 
from customers 
where name rlike 'Sam|Ron'

We can select portions of strings by using the substring function. For example, if I wanted certain characters from the postcode field, I could use the below. 1 is the starting character and 4 is the number of characters that I want to select.

select substr(postcode, 1, 4) as sub_postcode
from customers

We can concatinate strings. For example, if you had a firstname and a lastname column, you could create a fullname column, by concatinating the firstname and lastname together, with a space in the middle.

select concat(firstname, ' ', lastname) as fullname
from customers

We cab also cast strings to lower / upper case, which is really useful when we have user-inputted data, which is not always consistent. We can use the below to do so:

select lower(postcode) as lowercase, upper(postcode) as uppercase
from customers

Sometimes, strings may have whitespace at the end, which means name = ‘josh’ may not work. As it may be stored as ‘josh ‘. We can remove the whitespace using trim:

select trim(name) as name2 
from customers

We also have the option of rtrim and ltrim, which will trim only one side of the string.

select rtrim(name) as right_trim, ltrim(name) as left_trim 
from customers

We can split strings using the split function, which is very useful. The below will split the postcode of XX88 XXX at the space and return the first chunk (XX88).

select split(postcode, ' ')[0]
from customers

.

Kodey

One thought on “The Hive SQL Crash Course For Data Analysts

Comments are closed.