A guide to windowing functions in Hive

Windowing functions in Hive are super useful. They make analysis that would otherwise be challenging, much easier. Let’s look at examples of the key windowing functions in Hive. We will use the below dataset for our analysis, this is the ‘payments’ table.

customeriddepartmentpayment_amount
1110000
219000
3212000
4215000
526000
634000
7318000
8311000
932000

Windowed Aggregations

If you wanted to calculate the percentage of total payments per department that each individual transaction comprised, it may be fiddly without windowing functions.

In the below, we select the sum of payment_amount over the partition of departments. We can read that as ‘FOR EACH DEPARTMENT, SUM(PAYMENT_AMOUNT)’. This gives us a new column, with the departmental total.

select customerid, 
department, 
sum(payment_amount) over (partition by department order by department asc) as total_payments 
from payments
customeriddepartmentpayment_amounttotal_payments
111000019000
21900019000
321200033000
421500033000
52600033000
63400035000
731800035000
831100035000
93200035000

We can do the same with any of our aggregate functions, including min, max and count. An example of count is below, where we count the number of customers in a department.

select customerid, 
department, 
count(payment_amount) over (partition by department order by department asc) as customer_count 
from payments
customeriddepartmentpayment_amountcustomer_count
11100002
2190002
32120003
42150003
5260003
6340004
73180004
83110004
9320004

Lead and Lag

Often, in timeseries data, you want to compare the current record with the preceding or subsequent record. Let’s consider the below dataset to understand the lead and lag functions. In this example, we have the sales, per store for a given company.

Monthstoreincome
1110000
2110100
1222000
2211900
3114666
3215000

If we wanted to calculate the change from one month to the next, we could use the lag function. In the below we lag income by 1. Which means, we take the immediately preceding value. We also set the default value to zero, which means, if there is no preceding row, the default value will be zero, rather than null.

If we wanted the next value (i.e. to bring the value from month 2 into the row for month 1), we could use the lead function.

select month, store, income, 
lag(income, 1,0) over (partition by store order by month asc) as last_month
from sales

The result of the function is shown below. I’ve retained only store 1 to make it easier to see what’s happening. You can see that this brings data from another row, making it easier for us to carry out our analysis.

Monthstoreincomelast_month
11100000
211010010000
311466610100

Rank

The rank function does exactly what it says on the tin. It lets us rank the rows in our dataset by a given value. In the below, we rank the months, based on sales.

select month, store, income,
rank() over (partition by store order by income desc ) as rank
from sales
Monthstoreincomerank
31146661
21101002
11100003

Kodey