Working with dates in Apache Hive

Working with dates is one of those tedious things we frequently come across as data engineers. The frustration is that there are simply tonnes of date formats. Let’s list a few:

Format
Example
MM/dd/yy11/01/21
dd/MM/yy01/11/21
yy/MM/dd21/11/01
d/MM/yy1/11/21 (no leading zeros)
MMddyy110121
ddMMyy011121
yyyyMMdd20211101
yyyy-MM-dd HH:mm:ss.SSS01-11-2021 10:45:12.084
yyyy-MM-dd HH:mm:ss01-11-2021 10:45:12
yyyy-MM-dd HH:mm01-11-2021 10:45

The key part of working with dates in Hive is the unix timestamp. A unix timestamp is a representation of a point in time in seconds. Once we have converted our date to be in unix timestamp format, we can do almost anything we like with it.

For example, in the below, I have the date: 01-11-2021 10:45:12.084; which when converted to unix timestamp is 1610361912. If we look at 10 minutes after to this, we get 01-11-2021 10:55:12.084; which converts to 1610362512. The difference between these two times is 600 seconds, which is 10 minutes.

In Hive, we can simply subtract one unix_timestamp from another to calculate the difference.

select 
unix_timestamp('01-11-2021 10:45:12.084','MM-dd-yyyy HH:mm:ss.SSS') - unix_timestamp('01-11-2021 10:55:12.084','MM-dd-yyyy HH:mm:ss.SSS')

Calculate the week number

With this in mind, let’s look at some examples. What if I wanted to take the week from a year? Well, I’d still convert the date to a unix timestamp, and specify it’s current format. I would then use the weekofyear function to extract the week number. Below are two examples, with different input date formats.

select weekofyear(from_unixtime(unix_timestamp('20210419' ,'yyyyMMdd'),'yyyy-MM-dd'))

select weekofyear(from_unixtime(unix_timestamp('2021-04-19' ,'yyyy-MM-dd'),'yyyy-MM-dd'))

Calculate number of days between two dates

Often, we want to find out how long ago something happened. That could be, how long ago it was compared with today or it could be the difference between (for example) an order date & the delivery date. In the first statement below, I take the current date as simply being unix_timestamp() and work out the difference between this and 20210419 (the order date in yyyyMMdd format). In the second line, I carry out the same operation, but using a slighly different input date format yyyy-MM-dd. In the final line, I work out the difference between two dates (where I am not using today as one of the arguments).

As shown in the penultimate line of the below, we can do this with full date time stamps too. The example will return the result 0, because the difference is only 10 minutes and hence there are no ‘days’ difference.

The final line would return an answer of 3, as there are three days between the two dates, it will not tell you the difference in hours/minutes.

select datediff(to_date(from_unixtime(unix_timestamp('20210419' ,'yyyyMMdd'), 'yyyy-MM-dd')), from_unixtime(unix_timestamp())) as date_diff

select datediff(to_date(from_unixtime(unix_timestamp('2021-04-19' ,'yyyy-MM-dd'), 'yyyy-MM-dd')), from_unixtime(unix_timestamp())) as date_diff

select datediff(to_date(from_unixtime(unix_timestamp('2021-04-19' ,'yyyy-MM-dd'), 'yyyy-MM-dd')), to_date(from_unixtime(unix_timestamp('2021-06-19' ,'yyyy-MM-dd'), 'yyyy-MM-dd'))) as date_diff

select datediff(to_date(from_unixtime(unix_timestamp('01-11-2021 10:45:12.084','MM-dd-yyyy HH:mm:ss.SSS'), 'yyyy-MM-dd')), to_date(from_unixtime(unix_timestamp('01-11-2021 10:55:12.084','MM-dd-yyyy HH:mm:ss.SSS'), 'yyyy-MM-dd'))) as date_diff

select datediff(to_date(from_unixtime(unix_timestamp('07-14-2021 10:45:12.084','MM-dd-yyyy HH:mm:ss.SSS'), 'yyyy-MM-dd')), to_date(from_unixtime(unix_timestamp('07-11-2021 10:55:12.084','MM-dd-yyyy HH:mm:ss.SSS'), 'yyyy-MM-dd'))) as date_diff

Cast any date format to be a date type

In the below, we have two dates, which changes the date format from one, to another format.

select to_date(from_unixtime(unix_timestamp('20210419' ,'yyyyMMdd'), 'yyyy-MM-dd'))

select to_date(from_unixtime(unix_timestamp('2021-04-19' ,'yyyy-MM-dd'), 'yyyy/MM/dd'))

Calculate date minus N days / hours/ minutes

In line one of the below, we take the current date and subtract 1 days from it. This will return the date format yyyy-MM-dd.

In line 2, we replace the ‘-‘ in the date and replace it with nothing, creating a yyyyMMdd format.

In line 3, we look at converting the output to any format we like, which we declare in the very last bit of the statement as yyyy/MM/dd.

In line 4, we’re looking at how to work out N days before any date (not just today), by passing our date 20210419 into the function.

Finally, we can look at subtracting 10 minutes from the timestamp. We convert the datetimestamp to unix_timestamp and subtract 10*60 seconds (10 minutes). The output of this query will be the same date, at 10:35:12.

select date_sub(current_date, 1)

select regexp_replace(date_sub(current_date, 1), '-', '')

select from_unixtime(unix_timestamp(cast(date_sub(current_date, 1) as string) ,'yyyy-MM-dd'), 'yyyy/MM/dd')

select date_sub(to_date(from_unixtime(unix_timestamp('20210419' ,'yyyyMMdd'), 'yyyy-MM-dd')), 1)

select from_unixtime(unix_timestamp('07-14-2021 10:45:12.084','MM-dd-yyyy HH:mm:ss.SSS') - 10 * 60) AS result

Add N days / hours / minutes to a date

In the top line below, we can take our date of any format, casting it to another format, while adding 1 day.

In the second line, we convert the datetimestamp to unix_timestamp and add 10*60 seconds (10 minutes). The output of this query will be the same date, at 10:55:12.

select date_add(to_date(from_unixtime(unix_timestamp('20210419' ,'yyyyMMdd'), 'yyyy-MM-dd')), 1)

SELECT from_unixtime(unix_timestamp('07-14-2021 10:45:12.084','MM-dd-yyyy HH:mm:ss.SSS') + 10 * 60) AS result

Find out what day of the week it is

We can find out what day of the week it is, by using the below structures. The first line looks at using the current date while the second line will take any input date and tell you which day of the week it is. Remember, the week starts on Sunday (Day 1).

select dayofweek(current_date())

select dayofweek(to_date(from_unixtime(unix_timestamp('20210419' ,'yyyyMMdd'), 'yyyy-MM-dd')))

Hopefully this has been a useful introduction to dates in SQL.

Kodey