Parameters to make your Hive queries perform better

Hive, in my experience, is a platform which can have extremely variable performance, which can make it difficult to predict how long your jobs are going to take to run. Below are a few key optimisation techniques we can use to help make our lives a bit better!

Choose the right file type

First of all, let’s talk about file types. It’s important to choose the right file type for your use-case.

FeatureCSV / TextAVROParquetORC
DescriptionText or CSV files are really common place. They are OK for smaller data sizes, but CSV files do not allow for block compression, which means you will add a significant performance overhead when reading this data.Avro files are row-based and are suitable for write-heavy workloads. It stores the schema in the header of the file, so they are self-describing.Parquet files are column-oriented. They are good for read-heavy workloads. They can achieve up to 75% compression with Snappy. Only required columns would be read – reducing Disk I/O.ORC files are column orientied and hence are good for read-heavy workloads. They have really good compression. ORC stands for Optimised Record Columnar
CompressionNAGoodExcellent
Best ForSparkHive
Use-CaseWrite Heavy WorkloadsWrite Heavy WorkloadsReady Heavy WorkloadsRead Heavy Workloads
Splittable across multiple disksNoYesYesYes

To summarise the above:

  • For small datasets, for quick; messy analysis, there is no problem using CSV
  • For heavy ETL processes, which require massive write-throughput, we should use AVRO
  • For heavy analytical workloads in Spark, we should use Parquet
  • For heavy analytical workloads in Hive, we should use ORC

To create an ORC table, we simply do the below.

Create table my_new_table (field1 int, field2 string, field3 string) stored as ORC 
tblproperties (“orc.compress”= “SNAPPY”);

Insert overwrite table my_new_table select * from tablename;

--- convert from text to ORC

Create table my_new_table (field1 int, field2 string, field3 string) stored as ORC 
tblproperties (“orc.compress”= “SNAPPY”);

Select * from table1 Insert into my_new_table;

Use a map join

Out of the box joins in Hive can be a bit slow. If you are going to be joining one small table with a very large table, we can use a map join. This puts the smaller table into memory, making the join process much faster. When we set the hive.auto.convert.join parameter to be true, Hive will check if the smaller table is less than the hive.mapjoin.smalltable.filesize limit set on the platform. If it’s smaller, it will load the table into memory; if it’s bigger, it will use a standard join. We can use a map join with the below config.

set hive.auto.convert.join

You can also use skew joins when your dataset is skewed!

Parallelisation

Hive converts our queries into many different stages, which we can see in our DAGs. Some of these stages can be independent (they don’t need to happen in a linear fashion). We can enable parallel execution for these kinds of jobs:

Set hive.exec.parallel = true;

Vectorize!

When we do table scans, aggregation, filtering, joins etc.. on our data, Hive does it (by default) one row at a time. We can force Hive to process these 1024 rows at a time instead, which leads to faster execution.

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

Use Tez

Tez is more performant than map reduce, you can enable the use of Tez with the below command.

set hive.execution.engine=tez;

Do you need global sorting?

When you run an ORDER BY command, it can be really slow. This is because, we have to map all the results onto a single reducer, to provide the result you want (a sorted list of all the data). We can get around this by using SORT BY, which sorts the data per reducer. If this is enough, then it’s worth considering!

Manually adust system resources

You can adjust container sizes etc.. using some simple parameters below. The full list of tez parameters can be found here.

set hive.execution.engine=tez;
Set hive.tez.container.size=6656;
Set hive.tez.java.opts=-Xmx4g; -- should be 80% of tez container size
set tez.am.resource.memory.mb=4096;

There are of course plenty more ways to optimise your Hive job. However, the above, in conjunction with well written queries, should help you along your journey to more performant jobs!

Kodey