Getting Started with Apache Hive – A Must Know Tool For all Big Data and Data Engineering Professionals

lakshayarora 14 Dec, 2020

6 min read

Overview

Understand the Apache Hive architecture and its working.
We will learn to do some basic operations in Apache Hive.

Introduction

Most of the Data Scientists use SQL queries in order to explore the data and get valuable insights from them. Now, as the volume of data is growing at such a high pace, we need new dedicated tools to deal with big volumes of data.

Initially, Hadoop came up and became one of the most popular tools to process and store big data. But developers were required to write complex map-reduce codes to work with Hadoop. This is Facebook’s Apache Hive came to rescue. It is another tool designed to work with Hadoop. We can write SQL like queries in the hive and in the backend it converts them into the map-reduce jobs.

In this article, we will see the architecture of the hive and its working. We will also learn how to do simple operations like creating a database and table, loading data, modifying the table.

What is Apache Hive?

Apache Hive is a data warehouse system developed by Facebook to process a huge amount of structure data in Hadoop. We know that to process the data using Hadoop, we need to right complex map-reduce functions which is not an easy task for most of the developers. Hive makes this work very easy for us.

It uses a scripting language called HiveQL which is almost similar to the SQL. So now, we just have to write SQL-like commands and at the backend of Hive will automatically convert them into the map-reduce jobs.

Apache Hive Architecture

Let’s have a look at the following diagram which shows the architecture.

hive architecture

Hive Clients: It allows us to write hive applications using different types of clients such as thrift server, JDBC driver for Java, and Hive applications and also supports the applications that use ODBC protocol.
Hive Services: As a developer, if we wish to process any data, we need to use the hive services such as hive CLI (Command Line Interface). In addition to that hive also provides a web-based interface to run the hive applications.
Hive Driver: It is capable of receiving queries from multiple resources like thrift, JDBC, and ODBS using the hive server and directly from hive CLI and web-based UI. After receiving the queries, it transfers it to the compiler.
HiveQL Engine: It receives the query from the compiler and converts the SQL like query into the map-reduce jobs.
Meta Store: Here hive stores the meta-information about the databases like schema of the table, data types of the columns, location in the HDFS, etc
HDFS: It is simply the Hadoop distributed file system used to store the data. I would highly recommend you to go through this article to learn more about the HDFS: Introduction to the Hadoop Ecosystem

Working of Apache Hive

Now, let’s have a look at the working of the Hive over the Hadoop framework.

apache hive vs hadoop

In the first step, we write down the query using the web interface or the command-line interface of the hive. It sends it to the driver to execute the query.
In the next step, the driver sends the received query to the compiler where the compiler verifies the syntax.
And once the syntax verification is done, it requests metadata from the meta store.
Now, the metadata provides information like the database, tables, data types of the column in response to the query back to the compiler.
The compiler again checks all the requirements received from the meta store and sends the execution plan to the driver.
Now, the driver sends the execution plan to the HiveQL process engine where the engine converts the query into the map-reduce job.
After the query is converted into the map-reduce job, it sends the task information to the Hadoop where the processing of the query begins and at the same time it updates the metadata about the map-reduce job in the meta store.
Once the processing is done, the execution engine receives the results of the query.
The execution engine transfers the results back to the driver and which finally sends to the hive user-interface from where we can see the results.

Data Types in Apache Hive

Hive data types are divided into the following 5 different categories:

Numeric Type: TINYINT, SMALLINT, INT, BIGINT
Date/Time Types: TIMESTAMP, DATE, INTERVAL
String Types: STRING, VARCHAR, CHAR
Complex Types: STRUCT, MAP, UNION, ARRAY
Misc Types: BOOLEAN, BINARY

Here is a small description of a few of them.

Apache hive - data types

Create and Drop Database

Creating and Dropping database is very simple and similar to the SQL. We need to assign a unique name to each of the databases in the hive. If the database already exists, it will show a warning and to suppress this warning you can add the keywords IF NOT EXISTS after the database keyword.

CREATE DATABASE <<database_name>> ;

Dropping a database is also very simple, you just need to write a drop database and the database name to be dropped. If you try to drop the database that doesn’t exist, it will give you the SemanticException error.

DROP DATABASE <<database_name>> ;

Create Table

We use the create table statement to create a table and the complete syntax is as follows.

CREATE TABLE IF NOT EXISTS <<database_name.>><<table_name>> 
                           (column_name_1 data_type_1, 
                            column_name_2 data_type_2,
                            .
                            .
                            column_name_n data_type_n)
                            ROW FORMAT DELIMITED FIELDS 
                            TERMINATED BY '\t'
                            LINES TERMINATED BY '\n'
                            STORED AS TEXTFILE;

If you are already using the database, you are not required to write database_name.table_name. In that case, you can only write the table name. In the case of Big Data, most of the time we import the data from external files so here we can pre-define the delimiter used in the file, line terminator and we can also define how we want to store the table.

There are 2 different types of hive tables Internal and External tables. Please go through this article to know more about the concept: Types of Tables in Apache Hive: A Quick Overview

Load Data into Table

Now, the tables have been created. It’s time to load the data into it. We can load the data from any local file on our system using the following syntax.

LOAD DATA LOCAL INPATH <<path of file on your local system>> 
                       INTO TABLE
                       <<database_name.>><<table_name>> ;

When we work with a huge amount of data, there is a possibility of having unmatched data types in some of the rows. In that case, the hive will not throw any error rather it will fill null values in place of them. This is a very useful feature as loading big data files into the hive is an expensive process and we do not want to load the entire dataset just because of few files.

Alter Table

In the hive, we can do multiple modifications to the existing tables like renaming the tables, adding more columns to the table. The commands to alter the table are very much similar to the SQL commands.

Here is the syntax to rename the table:

ALTER TABLE <<table_name>> RENAME TO <<new_name>> ;

Syntax to add more columns from the table:

## to add more columns
ALTER TABLE <<table_name>> ADD COLUMNS 
                           (new_column_name_1 data_type_1,
                            new_column_name_2 data_type_2,
                            . 
                            .
                            new_column_name_n data_type_n) ;

Advantages/Disadvantages of Apache Hive

Uses SQL like query language which is already familiar to most of the developers so makes it easy to use.
It is highly scalable, you can use it to process any size of data.
Supports multiple databases like MySQL, derby, Postgres, and Oracle for its metastore.
Supports multiple data formats also allows indexing, partitioning, and bucketing for query optimization.
Can only deal with cold data and is useless when it comes to processing real-time data.
It is comparatively slower than some of its competitors. If your use-case is mostly about batch processing then Hive is well and fine.

End Notes

In this article, we have seen the architecture of the Apache Hive and its working and some of the basic operations to get started with. In the next article of this series, we will see some of the more complex and important concepts of partitioning and bucketing in a hive.

If you have any questions related to this article do let me know in the comments section below.