Nausheen Aafreen — Published On February 20, 2022 and Last Modified On March 22nd, 2022
Data Engineering Hadoop Intermediate

This article was published as a part of the Data Science Blogathon.

Introduction

Apache Hive is a data warehouse system built on top of Hadoop which gives the user the flexibility to write complex MapReduce programs in form of SQL- like queries. Performance Tuning is an essential part of running Hive Queries as it helps the programmer to reduce the time and increase the efficiency of the code. Applying correct approaches and using best optimization practices we can achieve great milestones in terms of time and storage of hive queries as well as improve the performance of our cluster, and it is here that Performance Tuning plays an important role. 

In this article, I want to share some of the practices of Performance Tuning configuration and queries, which I have found effective while using Hive.

First of all, let’s understand what are the steps for tuning the Hive queries:

  1. Identify: In this step, we list all the Hive codes which consume more time.
  2. Tune: We find out the best solutions without interfering with the functionality of the code and make sure that the code is improved.
  3. Test and Deploy: We test the changed code and then deploy it in production. We should monitor the running time of the changed code for a few days.
  4. Measure: In this step, we measure if the changed codes are improving at any time. Also, compare the new run timings of the changed code with the past production timings.
  5. Continue improvements: We keep looking for possible solutions to improve further.

Types of Performance Tuning Techniques

There are several Hive optimization techniques to improve its performance which we can implement when we run our hive queries, thereby focusing on the Performance Tuning as well:

1 Avoid locking of tables

It is extremely important to make sure that the tables are being used in any Hive query as sources are not being used by another process. This can lead to locking of the table and our query can be stuck for an unknown time.

We can use the parameters below for making sure that the tables are not being locked:

set hive.support.concurrency=false;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager;
set hive.bin.strict.locking.mode=false;

2 Use the Hive execution engine as TEZ

While executing Hive queries, TEZ execution engine is the preferred choice because it eliminates unnecessary disk access. It takes data from disk once, performs calculations, and produces output, thus saving us from multiple disk traversals. We can consider TEZ to be a much more flexible and powerful successor to the map-reduce framework.

We can set the parameter below for using TEZ engine:

set hive.execution.engine=tez;

3 Use Hive Cost Based Optimizer (CBO)

Apache Hive provides a cost-based optimizer to improve performance. It generates efficient execution plans like how to order joins, which type of join to perform, the degree of parallelism etc. by examining the query cost. These decisions are collected by ANALYZE statements or the metastore itself, ultimately cutting down on query execution time and reducing resource utilization.

We can set the parameter using :

set hive.cbo.enable=true;

Parallel execution at a Mapper & Reducer level

We can improve the performance of aggregations, filters, and joins of our hive queries by using vectorized query execution, which means scanning them in batches of 1024 rows at once instead of single row each time.

We should explore the below parameters which will help to bring in more parallelism and which significantly improves query execution time:

set hive.vectorized.execution.enabled=true; 
set hive.exec.parallel=true;

For example:

Select a.*, b.* from
(select * from table1 ) a
Join
(select * from table2 ) b
On a.id=b.id
;

As we can see the two subqueries are independent so this might increase efficiency.

One important thing to note is, parallel execution will increase cluster utilization. If the cluster utilization of a cluster is already very high, parallel execution will not help much.

5 Use STREAMTABLE option

When we are joining multiple tables, we can use STREAMTABLE option. By default, the right-most table gets streamed.

For example: If we are joining 2 tables ‘huge_table’ join ‘small_table’, by default ‘small_table’ gets streamed as it is the rightmost table. In this case, ‘huge_table’ , being the bigger table, will try to get buffered into memory and might cause java heap space issues or the job might run longer. In this case, what we can do is add /*+ STREAMTABLE(‘huge_table’) */ and it will make ‘huge_table’ to be streamed rather than coming into memory.

Hence, in this way, we can be free of remembering the order of joining tables.

6 Use Map Side JOIN Option

If one of the tables in the join is a small table and can be loaded into memory, we can force a MAPSIDE join like shown below:

Select /*+ MAPJOIN(small_table) */ large_table.col1,large_table.col2 from large_table join small_table on large_table.col1 = small_table.col1;

A map side join can be performed within a mapper without using a Map/Reduce step.

Also, We can let the execution engine take care of this by setting auto.convert.join as True.

set auto.convert.join = True :

7 Avoid Calculated Fields in JOIN and WHERE clause

We should avoid using any calculated fields in JOIN and WHERE clauses as they take a long time to run the Hive query. We can use CTE(Create table expression) to handle those functionalities and can optimize our queries.

For example:

Original query:

select a.coll, b.col2 from table1 as a join table2 as b on (a.coll +50 = b.col2);

Optimized query:

with CTE as
(select a.col1 + 50 as C1 FROM table1 )
select CTE.C1, b.col2 from CTE join table2 b on (CTE.C1 = b.col2);

8 Use SORT BY instead of ORDER BY

Hive supports both ORDER BY and SORT BY causes. ORDER BY works on a single reducer and it causes a performance bottleneck. But, SORT BY orders the data only within each reducer and performs a local ordering where each reducer’s output will be sorted ensuring better performance.

9  Select columns which are needed

While we are using Hive, If we need only a few columns from a table, avoid using SELECT * FROM as it adds unnecessary time to the execution.

10 Suitable Input format Selection

Using appropriate file formats on the basis of data can significantly increase our query performance. Hive comes with columnar input formats like RCFile, ORC, etc. On comparing to Text, Sequence, and RC file formats, ORC shows better performance because Hive has a vectorized ORC reader which allows reducing the read operations in analytics queries by allowing each column to be accessed individually.

11 Limit (Filter) the data as early as possible 

This is the fundamental principle for any tuning where we filter or drop records ahead in the process so that we can avoid dealing with long-running of queries and dealing with unnecessary results.

For example :

a join b where a.col !=null 

can be written as

 (select * from a where a.col!=null) join b

12 Use Multi Query Inserts

Here we will have a common dataset (Superset) and then we will use that to insert into multiple tables based on specific conditions specified in the WHERE clause. This helps to load multiple tables in parallel when they have the common superset of data.

13 Modularize the code into logical pieces

This helps in terms of the maintenance of the code. Also helps in restarting the jobs in scenarios of failures. This can be used to achieve parallelism by running modules that are independent of each other thus saving time.

Takeaway Notes

Use hive parameters judiciously

We should avoid using a lot of hive parameters if they are not needed. For this we should understand the data well and then use it efficiently else we might end up impacting the overall cluster utilization impacting other processes.

Identify the right targets

Most of these techniques will work on all jobs but spend the effort where it counts. Start with the jobs that will have the biggest impact on performance, that is the ones that take many hours!

Even small gains add up quickly

For example, A 10% gain on a 5-hour job saves 30 minutes of processing time!

Conclusion

In the article, we covered the Hive Optimization Techniques for Hive Queries both configuration-wise and query-wise, and how Performance Tuning can be implemented using hive. I hope this article helped you in understanding how to efficiently tune the performance of hive queries to execute them faster on clusters. Please don’t forget to drop in your comments if you have any doubts.

About the Author

Nausheen Aafreen is a Big Data Engineer and Data Research Analyst working in a multinational artificial intelligence company. She has spent over 5.6 years working in Data Engineering and Data Science projects. She worked in various technologies including big data, automation, and data research.

 

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Nausheen Aafreen

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *