Windowing functions in Hive are super useful. They make analysis that would otherwise be challenging, much easier. Let’s look at examples of the key windowing functions in Hive. We will use the below dataset for our analysis, this is the ‘payments’ table.
customerid | department | payment_amount |
1 | 1 | 10000 |
2 | 1 | 9000 |
3 | 2 | 12000 |
4 | 2 | 15000 |
5 | 2 | 6000 |
6 | 3 | 4000 |
7 | 3 | 18000 |
8 | 3 | 11000 |
9 | 3 | 2000 |
Windowed Aggregations
If you wanted to calculate the percentage of total payments per department that each individual transaction comprised, it may be fiddly without windowing functions.
In the below, we select the sum of payment_amount over the partition of departments. We can read that as ‘FOR EACH DEPARTMENT, SUM(PAYMENT_AMOUNT)’. This gives us a new column, with the departmental total.
select customerid,
department,
sum(payment_amount) over (partition by department order by department asc) as total_payments
from payments
customerid | department | payment_amount | total_payments |
1 | 1 | 10000 | 19000 |
2 | 1 | 9000 | 19000 |
3 | 2 | 12000 | 33000 |
4 | 2 | 15000 | 33000 |
5 | 2 | 6000 | 33000 |
6 | 3 | 4000 | 35000 |
7 | 3 | 18000 | 35000 |
8 | 3 | 11000 | 35000 |
9 | 3 | 2000 | 35000 |
We can do the same with any of our aggregate functions, including min, max and count. An example of count is below, where we count the number of customers in a department.
select customerid,
department,
count(payment_amount) over (partition by department order by department asc) as customer_count
from payments
customerid | department | payment_amount | customer_count |
1 | 1 | 10000 | 2 |
2 | 1 | 9000 | 2 |
3 | 2 | 12000 | 3 |
4 | 2 | 15000 | 3 |
5 | 2 | 6000 | 3 |
6 | 3 | 4000 | 4 |
7 | 3 | 18000 | 4 |
8 | 3 | 11000 | 4 |
9 | 3 | 2000 | 4 |
Lead and Lag
Often, in timeseries data, you want to compare the current record with the preceding or subsequent record. Let’s consider the below dataset to understand the lead and lag functions. In this example, we have the sales, per store for a given company.
Month | store | income |
1 | 1 | 10000 |
2 | 1 | 10100 |
1 | 2 | 22000 |
2 | 2 | 11900 |
3 | 1 | 14666 |
3 | 2 | 15000 |
If we wanted to calculate the change from one month to the next, we could use the lag function. In the below we lag income by 1. Which means, we take the immediately preceding value. We also set the default value to zero, which means, if there is no preceding row, the default value will be zero, rather than null.
If we wanted the next value (i.e. to bring the value from month 2 into the row for month 1), we could use the lead function.
select month, store, income,
lag(income, 1,0) over (partition by store order by month asc) as last_month
from sales
The result of the function is shown below. I’ve retained only store 1 to make it easier to see what’s happening. You can see that this brings data from another row, making it easier for us to carry out our analysis.
Month | store | income | last_month |
1 | 1 | 10000 | 0 |
2 | 1 | 10100 | 10000 |
3 | 1 | 14666 | 10100 |
Rank
The rank function does exactly what it says on the tin. It lets us rank the rows in our dataset by a given value. In the below, we rank the months, based on sales.
select month, store, income,
rank() over (partition by store order by income desc ) as rank
from sales
Month | store | income | rank |
3 | 1 | 14666 | 1 |
2 | 1 | 10100 | 2 |
1 | 1 | 10000 | 3 |