A Comprehensive Guide to Apache Hive
This article was published as a part of the Data Science Blogathon.
Introduction on Apache Hive
Advanced big data tools must handle the massive amounts of structured and unstructured data generated daily. Data is not increasing only in terms of volume, but the variety and veracity of data are also growing. Big Data uses Hive on top of Hadoop to maintain its value. Data scientists and analysts mainly use Hive as a dedicated tool to turn the raw information into actionable content.
What is Hive?
Apache Hive is an open-source ETL and data warehousing infrastructure that processes structured data in Hadoop. It facilitates the reading, writing, summarizing, querying, and analyzing of massive datasets stored in distributed storage systems using Structured Query Language. Hive is helpful in performing frequent data warehousing jobs like Adhoc-querying, Data Encapsulation, and Analysis of massive datasets stored in distributed file systems like HDFS (Hadoop Distributed File System), which integrates Hadoop.
Apache Hive enables analytics at a vast scale and enhances fault tolerance, performance, scalability, and loose coupling with its input formats. What makes Hive unique is the ability to abstract the complexity of MapReduce jobs. Instead of writing the complex MapReduce jobs, we can write simple SQL-like queries, which reduces the overhead of remembering complex Java codes.
Facebook developed Hive to process their large volume of data(around 20TB per day), but the Apache Software Foundation later took it up. In addition, MNCs like Amazon and Netflix use it to query and analyze their data.
Features of Apache Hive
Below are the main features of Apache Hive that fabricate it into one of the most valuable data processing and analyzing tools for the current as well as the future industries:
Query massive datasets: Hive facilitates access to the files stored either directly in HDFS or in other data storage systems such as HBase and manages the vast datasets.
File Formats: Various types of file formats like textfile, ORC, Parquet, LZO Compression, SEQUENCE FILE, RCFILE (Record Columnar File), etc., are supported by Apache Hive.
Hive-Query Language: This language is similar to SQL. Only the basic knowledge of SQL is enough to work with Hive, such as tables, rows, columns, schema, etc. It makes learning more accessible by utilizing familiar concepts found in relational databases, such as columns, tables, rows, schema, etc. The most significant difference between HiveQL and SQL is that Hive performs queries on Hadoop’s infrastructure, whereas SQL performs queries on a traditional database.
Fast: Hive is a Fast, scalable, extensible tool that enhances the querying on Hadoop.
Partition Support: To improve the query performance, Hive uses directory structures to “partition” data. The partitions and buckets lead to fast data retrieval.
UDF Support: Programmers can define Hive user-defined functions (UDFs) for jobs like data cleansing and filtering as per their requirements. Built-in UDFs are used to manipulate strings, dates, and other data-mining and warehousing tools.
Storage Support: Apache Hive supports various storage types such as HDFS, Apache Hbase, plain text, CSV, XML, etc.
ETL Support: Apache Hive supports the ETL Functionalities, i.e., extract, transform, and load data into tables coupled with joins, partitions, etc.
Table Structure: Hive manages and processes only structured data, similar to RDBMS. So, firstly tables and databases get created; then data gets loaded into the respective tables. At the time of query execution, the Metadata storage in an RDBMS reduces the time to function semantic checks.
Ad-hoc Queries: Ad-hoc queries are variable-dependent queries whose value depends on some other variable, and Apache Hive also supports these queries.
Open-Source: There is no need to pay while using Apache Hive as it is an open-source tool.
Hive Architecture and Components
Apache Hive architecture consists mainly of three components:
Hive Storage and Computer
Let’s explore the core components of Apache Hive!
The Hive client is the interface through which we can submit the hive queries. It supports the applications written in any programming language like python, java, C++, Ruby, etc. With the help of JDBC, ODBC, and thrift drivers, it performs any queries on the Hive with the preferred language.
Apache Hive clients are divided into three categories:
Thrift Clients: As the Apache Hive is Thrift-based, it can handle the request from a thrift client.
ODBC Client: ODBC(Open database connectivity) client is the Apache Hive’s driver that allows client applications based on ODBC protocol to connect to Hive.
JDBC Client: Java database connectivity(JDBC) is used by Hive to connect with java applications that support JDBC protocol. Thrift is used by JDBC drivers to communicate with the Hive Server.
To perform all queries and Hive client integration, Hive offers multiple services like the Beeline, Hive server, Hive compiler, etc. If a customer wants to perform any Hive-related operations, then they can contact Hive Services.
Following are the services offered by Hive:
Apache Hive Server
Hive server1 is built on Apache Thrift protocol and is also referred to as Thrift Server. Thrift Server handles the cross-platform communication with Hive and allows various client apps to submit requests to Hive and retrieve the final results. But the problem with HiveServer1 is its inability to handle concurrent requests from multiple clients. To overcome this issue, we use HiveServer2 as the successor of HiveServer1. HiveServer2 provides the best support for open API clients like JDBC and ODBC and handles concurrent requests from multiple clients.
Apache Hive Driver
The Apache Hive driver receives the HiveQL statements from different sources like web UI, CLI, Thrift, and JDBC/ODBC. Then, it creates the session handles and transfers the query to the compiler.
HiveServer2 supports a command shell named the Beeline, which users use to submit their queries and commands to the system. SQL LINE CLI-based Beeline is a JDBC client.
The job of the Hive compiler is to parse the query, perform semantic analysis, and type-checking on the various query expressions and query blocks with the help of metadata stored in the meta store. It generates the execution plan as a DAG(Directed Acyclic Graph) and converts HiveQL queries into MapReduce jobs.
To improve the scalability and efficiency of an execution plan, the Optimizer splits the task and performs the transformation operations.
The execution engine executes the logical plan generated by the compiler and optimizer according to their dependencies with the help of the Hadoop cluster.
Metastore is a relational database and a central repository that is used to store the metadata information about the table structure and partitions. It also stores information about a column and its type, serializer, and deserializer, needed for the reading and writing data operations, and related HDFS files where data is stored. For meta store configuration, we can choose any of the two modes:
Embedded: In this mode, the client can use JDBC to interact with the meta store directly.
Remote: In this mode, the meta store is helpful for non-Java apps and acts as a Thrift service.
HCatalog is built on the top of Hive meta store and used as Hadoop’s storage management layer. It exposes the tabular data of Hive’s meta store and enables users with various data processing tools like YARN, Pig, etc., to quickly get and put data on the grid/table.
WebHCat is an HTTP interface that performs Hive metadata operations and acts as REST API for HCatalog. It is a service provider to the user for running Hadoop MapReduce, Hive, and Pig tasks.
Hive Storage and Computing
In Hive computing, services like Meta Store, file system, and work clients communicate with Hive storage and perform the following actions for the Hive repositories.
Hive chooses the Meta storage database to store the Metadata information of tables, schemas, columns in a table, their respective data types, and HDFS mapping.
Hive is built on top of Hadoop, so it uses the HDFS of the Hadoop cluster to store the query results and data loaded in the tables.
Working of Apache Hive
To get a holistic view of the data flow, below are the steps involved in the working of Apache Hive:
Step 1: Execute Query
A query is executed by data analysts on the User Interface (UI), such as the Web user interface or Command Line Interface. This Hive interface sends the queries to the driver and performs the task of query execution. In this, UI calls the execute interface to any database driver such as JDBC, ODBC, etc.) to execute.
Step 2: Get a Plan
The Driver then interacts with the compiler to parse the query, track the requirements, and perform syntax analysis. To make the execution plan, the driver creates a session handle for the query and transfers the query to the compiler.
Step 3: Get Metadata
Now, the compiler sends the metadata request to any database like the Meta store to retrieve the necessary metadata from the meta store.
Step 4: Send Metadata
Metastore sends metadata as an acknowledgment to the compiler, which is used for semantic analysis of the expressions in the query tree.
Step 5: Send Plan
After checking all the requirements, the compiler transfers the generated execution plan to the driver for query execution.
Step 6: Execute a Plan
After getting the execution plan from a compiler, the driver forwards the execution plan to the execution engine.
Step 7: Submit jobs to MapReduce
Now the execution engine transfers the job to the JobTracker, which is present in the Name node, and then assigns this job to the DataTracker, which is present in the Data node. In this step, the query executes the MapReduce job, and the Execution Engine performs metadata operations with the meta store.
Step 8: Fetch Results
The execution engine fetches the results from Data nodes to the User Interface.
Step 9: Send Results
The results are sent to the driver by the execution engine and loaded on the front end (UI).
What is HiveQL?
HiveQL stands for Hive Query Language, a high-level SQL-like programming language used by Hive to process and analyze structured or semi-structured data in a Metastore. Hive offers a command-line interface (CLI) to write queries using HiveQL. HiveQL is the best-known tool for traditional data warehousing tasks that focuses on users who are familiar with SQL. In addition, it detaches users from the complexity of MapReduce programming and reuses the concepts of RDBMS like tables, schema, rows, and columns, to accelerate learning. Hive uses MySQL for multiple user metadata storage and derby for single user metadata storage.
Difference between SQL and HiveQL
Purpose: Apache Hive can be preferred for both purposes, i.e., Batch query processing (extensive dataset handling) and Interactive query processing(real-time data handling), while SQL is more suitable for managing data stored in the form of a relational database management system (RDBMS).
Data Analysis: SQL is more suitable for querying more detailed data, while HiveQL is used to process and handle complex data efficiently.
Architecture: SQL is an RDBMS-based programming language, while HiveQL is a data warehousing project used for data analysis.
Data Types: Integral, Floating-Point, Binary Strings and Text, Fixed-Point, and Temporal are the only 5 data types supported by SQL. But, HiveQL supports 9 data types, including Boolean, Fixed-Point, Floating-Point, Integral, Temporal, Text and Binary Strings, Array, Map, and Struct.
Multitable inserts: SQL does not support the multi-table insertions, while Hive does.
MapReduce: No concept of MapReduce is present at SQL, while HIVE supports MapReduce.
OLTP: OLTP stands for Online Transactional Processing which is supported by SQL, not by HiveQL.
Schema support: The schema supported by SQL is used for data storage, while the HiveQL schema is supported for data insertion.
Views: The views in HiveQL are in read-only format, while in SQL, views can be updated.
Data Size: HiveQL can handle petabytes of data, while SQL can only handle terabytes of data.
Introduction to the Apache Hive Shell Commands
To communicate with Hive, a very powerful tool is used called HiveQL Shell. Firstly, we have to install the Hive successfully over the Hadoop Ecosystem; then, only we can communicate with the Hive environment with the help of Java API and the HiveQL Shell. We can open any remote client access software like putty to start the Cloudera and type Hive to enter in the Hive Shell. Hie Shell allows you to write HivdeQL statements and queries the structured data.
Following are the basic HiveQL commands:
Create a Database
To create a database in the Apache Hive, we use the statement “Create Database.” The database in Hive is nothing but a namespace or a collection of two or more tables.
The statement “show databases” is used to display all the databases available in your Hive prompt.
The statement “use” is used to enter a specific database in Hive. Whenever you want to check your current database or the database name where you are working, the function “current_database()” is used.
Create a Table
To create a table in Hive, HiveQL uses the “create table” statement.
( ,.. )COMMENT ‘Add if you want(optional)’LOCATION ‘Location On HDFS(optional)’ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘,’ ;
Comment:- An optional comment can be added to the table as well as to any specific column.
Location:- An optional location can override the default database location.
ROW FORMAT DELIMITED:- This is a compulsory option representing that every new line means a new record entry.
FIELDS TERMINATED BY ‘,’:- It represents that all the column values are separated using a comma.
Describe statement is used to see the table’s metadata, which means it will show all the columns of a table with their data types.
Like we use the insert command in SQL to add data into the tables, we use the Load data statement in Hive. We can insert our pre-created entries from the local system or HDFS in the Hive table.
Load data [local/HDFS] inpath '' INTO table
Select statement of HiveQl is similar to the SQL and retrieves the entire table data.
SELECT [ALL | DISTINCT] expression1, expression2, ... FROM table-name;
HiveQl uses alter table statement for mainly two purposes:
To rename the table
HiveQL statement “RENAME TO” will rename the table.
Alter table RENAME TO
To Add Columns
HiveQL statement “ADD COLUMNS” is used to add new columns to the existing table.
ADD COLUMNS(col1 data type, col2 data type, …);
Note:- Alter command is also used to drop columns from an existing table, change the column name or data type of an existing column, and replace an existing column with a new column.
Dropping a table from the Hive meta store deletes the entire table with all rows and columns.
Conclusion on Apache Hive
The hive itself is a very vast concept; in this guide, we learned about some basic concepts of Apache Hive and HiveQL.
We learned about Hive, its features, architecture, and its components.
This guide also discussed how to interact with the Hive shell and perform various Linux-based HiveQL commands.
We also made a comparison of HiveQL with the structured query language.
We don’t have to install Hive explicitly. Instead, we can open the Cloudera to run these Linux-based Hive commands.
Cloudera and Apache natively support Apache Hive.
I hope this guide on Hive has helped you to gain a better understanding of how Hive works.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.