Hive, in my experience, is a platform which can have extremely variable performance, which can make it difficult to predict how long your jobs are going to take to run. Below are a few key optimisation techniques we can use to help make our lives a bit better!
Choose the right file type
First of all, let’s talk about file types. It’s important to choose the right file type for your use-case.
|Feature||CSV / Text||AVRO||Parquet||ORC|
|Description||Text or CSV files are really common place. They are OK for smaller data sizes, but CSV files do not allow for block compression, which means you will add a significant performance overhead when reading this data.||Avro files are row-based and are suitable for write-heavy workloads. It stores the schema in the header of the file, so they are self-describing.||Parquet files are column-oriented. They are good for read-heavy workloads. They can achieve up to 75% compression with Snappy. Only required columns would be read – reducing Disk I/O.||ORC files are column orientied and hence are good for read-heavy workloads. They have really good compression. ORC stands for Optimised Record Columnar|
|Use-Case||Write Heavy Workloads||Write Heavy Workloads||Ready Heavy Workloads||Read Heavy Workloads|
|Splittable across multiple disks||No||Yes||Yes||Yes|
To summarise the above:
- For small datasets, for quick; messy analysis, there is no problem using CSV
- For heavy ETL processes, which require massive write-throughput, we should use AVRO
- For heavy analytical workloads in Spark, we should use Parquet
- For heavy analytical workloads in Hive, we should use ORC
To create an ORC table, we simply do the below.
Create table my_new_table (field1 int, field2 string, field3 string) stored as ORC tblproperties (“orc.compress”= “SNAPPY”); Insert overwrite table my_new_table select * from tablename; --- convert from text to ORC Create table my_new_table (field1 int, field2 string, field3 string) stored as ORC tblproperties (“orc.compress”= “SNAPPY”); Select * from table1 Insert into my_new_table;
Use a map join
Out of the box joins in Hive can be a bit slow. If you are going to be joining one small table with a very large table, we can use a map join. This puts the smaller table into memory, making the join process much faster. When we set the hive.auto.convert.join parameter to be true, Hive will check if the smaller table is less than the hive.mapjoin.smalltable.filesize limit set on the platform. If it’s smaller, it will load the table into memory; if it’s bigger, it will use a standard join. We can use a map join with the below config.
You can also use skew joins when your dataset is skewed!
Hive converts our queries into many different stages, which we can see in our DAGs. Some of these stages can be independent (they don’t need to happen in a linear fashion). We can enable parallel execution for these kinds of jobs:
Set hive.exec.parallel = true;
When we do table scans, aggregation, filtering, joins etc.. on our data, Hive does it (by default) one row at a time. We can force Hive to process these 1024 rows at a time instead, which leads to faster execution.
set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true;
Tez is more performant than map reduce, you can enable the use of Tez with the below command.
Do you need global sorting?
When you run an ORDER BY command, it can be really slow. This is because, we have to map all the results onto a single reducer, to provide the result you want (a sorted list of all the data). We can get around this by using SORT BY, which sorts the data per reducer. If this is enough, then it’s worth considering!
Manually adust system resources
You can adjust container sizes etc.. using some simple parameters below. The full list of tez parameters can be found here.
set hive.execution.engine=tez; Set hive.tez.container.size=6656; Set hive.tez.java.opts=-Xmx4g; -- should be 80% of tez container size set tez.am.resource.memory.mb=4096;
There are of course plenty more ways to optimise your Hive job. However, the above, in conjunction with well written queries, should help you along your journey to more performant jobs!