Beginners Guide to Data Warehouse Using Hive Query Language
This article was published as a part of the Data Science Blogathon.
Have you ever wondered how big IT giants store and process huge amounts of data? Different organizations make use of different databases like an oracle database storing transactional data, MySQL for storing product data, and many others for different tasks. Storing the data in the database is not the only requirement but they need to analyze and extract meaningful information out of it so it can help in driving certain business decisions. As time and technology evolve the amount of data creation and ingestion also increases In 2006 Hadoop was launched which serves for storing and processing any type of data and the different organizations started using Hadoop. In 2006 Facebook switched from Oracle database which uses SQL queries to handle the database to Hadoop which understands only MapReduce. This became a redundant task for them to handle the database and they needed an interface that can convert SQL queries to MapReduce programs, run it in a cluster, and show you the results. This led to the creation of Hive and hive query language which we are going to explore in this article.
What is Hive?
Hive is a data warehousing package built on the top of Hadoop. A Data warehouse is a place where you store a massive amount of data. This data is always ready to be accessed, and ready to be reported so I have a BI tool like Power BI which can directly be installed on the data warehousing platform and produce intellectual reports.
- Hive is used for data analysis means to extract meaningful information from big data. It is created for the users who are comfortable with SQL because the default tool to process the data over Hadoop is Map-reduce which works with Java programming language and users not comfortable with java can use Hive to process the data over Hadoop.
- The query language of the hive has a separate name known as HiveQL or HQL (Hive query language). It is used for managing and querying structured data. while working with Hive there is no need for java.
In 2006-2007 RDBMS was used to store and manage the data. At that time Cron jobs were used for data collection (Job scheduling). And for ETL python was used. In this way to generate and collect data Oracle was used and could not support a large amount of data and needed more coding in java. SQL was easy to operate and in 2007 the Hive was developed by Facebook. And today more than 500 Tb of data is managed by Facebook.
Features of Hive
We know what is Hive and why it was developed by Facebook. It is also important to understand other than querying and analyzing the data what features Hive serves so it proves to be the best Hadoop warehousing tool for data analyzing and creating BI reports.
- It provides features of Partitioning and bucketing
- Easy to plug in custom map-reduce code. – If we want to process map-reduce unstructured data so we can embed it and process it using Hive
- Data is stored on HDFS – The processing of data takes place in Hive but originally the data always resides in Hive.
- It uses Hadoop for fault tolerance
- Hive is used for data mining, document indexing, predictive modelling (predicting user behaviour), custom facing UI.
Limitations of using Hive
Hive is simple, easy, and effective yet there are some of the limitations of any tool to work with and the same goes with Hive. When you start to work with a hive certain points need to understand and take care of and as well some points need to be avoided because Hive as a data warehousing tool is created to work with a massive amount of structured data stored in HDFS.
- Not recommended for Row-level updates – If you plan to insert, update, or delete in hive then it is not recommended on row-level updates because data is too large and who will like to play with a single row.
- Latency for queries is high – If you want the instant result from queries then hive is not recommended because hive works with a large amount of data and needs some time to execute the query.
- Not Designed for OLTP – Not recommended for online transaction processing
- When we load data in SQL and if there is any delimiter problem then it will throw an error in data loading while Hive will load the data successfully and when you run the query then it will throw the error in data type mismatch.
- RDBMS can be used for OLTP processing while Hive is not recommended use for OLTP data processing.
Setup Hive on your System
It is the perfect time to install Hive and get ready to practically learn Hive Query Language(HQL)
Step-1) Installing Java JDK
Hadoop completely runs on JAVA so the first thing is to install and set up the Java JDK path. You can install the latest JDK version through this link. If you already have java installed then no need to install it again.
Step-2) Set Path Variables
To make java available to all users you need to move it to the “/usr/local”. open root and type the following commands.
$ su password: # mv jdk1.7.0_71 /usr/local/ # exit
For setting up Java home path variables you can add the following commands to the ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71 export PATH=$PATH:$JAVA_HOME/bin
Step-3) Installing Hadoop
Hive is built on top of Hadoop so Hadoop must be installed on your system. If you have already installed the Hadoop tar file then on the command prompt check the version using the following command.
$ hadoop version
If you have not downloaded the Hadoop then download and extract Hadoop through these links. After then you can set Hadoop environment variables by appending the following commands to the ~/.bashrc file.
Step-4) Set Hadoop configurations
The core-site.xml file contains necessary information regarding port address, memory allocated to file system, etc. open the XML file and add the following properties between configuration tags.
Step-5) Download and Install Hive
Download Hive using this link. You can use the following command to verify the download and extract hive archive.
If you face any problem while setting the Hive environment then you can read all steps in detail and find all necessary configurations required on this blog.
Basic Hive Query Commands
The HQL query language is very similar to SQL. Only there are some additional functions and syntax to deal with CSV files and a large amount of data. If you have installed the hive then you can simply open the command prompt and type the hive which will launch the hive command line interface and make sure that the hive can talk to the appropriate Hadoop cluster. Type the below command which will show you the namenode address and IP address
🎯 To validate we are connected with the namenode we can list the files present in the root directory. So it makes sure that the hive is connected to the namenode.
dfs -ls /user/root
🎯 Now we are ready to run queries. To see all the databases present in the namenode use the below command.
🎯 switch to any database
🎯 To create a new database using the create command
create database database-name
🎯 To list the tables in a particular database we can use the below command.
Now if you are familiar with SQL then you must be familiar with the SQL clauses like FROM, where, group by, order by, having, etc. To understand the structure of the table uses the describe command.
🎯 Display all records from a particular table.
select * from table-name
And if you want to limit the output to a particular number of rows then you can use limit constraint.
Analyzing File Tabular Data using HQL
Now we will load one data file or create a table and insert some data into it and try to retrieve the data based on certain conditions. We have an orders table which has 4 columns. you can find the table here.
1) Using all Hive Query language get several orders by order status for a given date 2013-12-14.
select order_status, count(1), FROM orders where order_date = '2013-12-14 00:00:00.0' GROUP BY order_status ORDER BY order_status;
When we are not using join there is only one table in from clause and after that comes where clause in which we can give an expression which can be a derived field, constant, or valid expression that gives a value out of it.
2) write a Hive QL to get the number of complete orders for each date before ‘2013-12-14’.
select order_date, count(1) from orders where order_date <= '2013-12-14 00:00:00.0' AND order_status = 'Complete' GROUP BY order_date ORDER BY order_date;
3) Get several pending, reviewed, and on-hold orders for each date for the month of 2013 December.
select order_date, count(1) from orders where order_date LIKE '2013-12%' AND order_status in ('PENDING', 'PENDING_PAYMENT', 'PAYMENT_REVIEW', 'ONHOLD') GROUP BY order_date ORDER BY order_date;
We can also use between operator to find the records for a particular month which states that the record greater than the 1st date and smallest than the last date of the month.
select order_date, count(1) from orders where order_date BETWEEN '2013-12-01 00:00:00.0' AND '2013-12-31 00:00:00.0' AND order_status in ('PENDING', 'PENDING_PAYMENT', 'PAYMENT_REVIEW', 'ONHOLD') GROUP BY order_date ORDER BY order_date;
Create a Managed Table in Hive
Managed tables are also known as internal tables where the entire lifecycle of table data is managed and controlled by Hive. All the read and write operation with managed table is performed using Hive SQL commands. There is no precision in the hive in numeric data type and you can define precision on the string data type.
create Table orders_demo ( order_id INT, order_date DATE, order_cust_id INT, order_status VARCHAR(30) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Explanation ~ In MySQL it is date time and in Hive, we can give Date. Hive also supports string data type so instead of defining Varchar you can also assign it a string. After that, we need to define the row format which defines that the data that we are copying in the table is separated by which delimiter like in CSV files all the rows the fields are separated by a comma, sometimes by a pipe so this need to be defined. We can also load the files in which lines are terminated by a delimiter and mostly it is by a new line character. At last, you need to define the file format in which the file is stored like text file, sequence file, AVRO, etc. The default file format is a text file and you can also define a custom format.
To view the data structure using describe command and if you want to observe it in tabular structure then we can use the format keyword along with describing command.
describe formatted orders_demo
Creating External Table in Hive
An external table is a table where Hive has loose coupling with table data and while creating defining the external storage location is compulsory. First to load the data let us create one sample table and then we will create an external table.
CREATE TABLE deck_of_cards ( COLOR string, SUIT string, PIP string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
Load the data from the local file system
Most of the time you are supposed to download the data from some external means and then load it in Hive and perform all the required operations. Hive has very simple syntax to load the data from the local file system where you only need to define the type like the local file system and path where the data file is stored in your system.
LOAD DATA LOCAL INPATH '/root/demo/data/cards/deckofcards.txt' INTO TABLE deck_of_cards;
Load data from HDFS
HDFS is the main storage directory of Hadoop and being into the data analyzing team most of the time organization data is directly ingested into HDFS and you are supposed to load the data from HDFS to Hive and perform required operations and execute queries.
LOAD DATA INPATH '/user/root/cards/deckofcards.txt' INTO TABLE deck_of_cards;
Overwriting the Existing Data File
Sometimes wrong data might load into the system so without deleting the table structure you want to only change the table content or overwrite the content in the table so we use overwrite keyword with the Load command in Hive to overwrite the data over the existing file.
LOAD DATA LOCAL INPATH '/root/demo/data/cards/deckofcards.txt' OVERWRITE INTO TABLE deck_of_cards;
🎯 To create an external table we use the External keyword with create command. Download the text file that contains the data about cards through this GitHub link. here and then create a new folder in the root directory as cards and places a text file in the cards folder and then create an external table from the Hive terminal. And you can check that file is correctly placed inside the cards folder using the below command and then create a database as cards, switch to the cards database, and create an external table using the below commands.
hadoop fs -ls /user/root/cards
create database cards use cards
CREATE EXTERNAL TABLE deck_of_cards_external ( color string, suit string, pip string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/root/cards';
If you drop the external table then you will only lose the structure of the table and the data is not deleted. Indeed if you drop managed table then you will lose data as well as the table structure.
Create Hive Partitioned Table
Partitioning in hive means dividing the table into sub-parts based on the values of a particular column. The advantage of partitioning is the response time of query increases and becomes much faster to retrieve data. It is similar is remove functional dependency from tables in DBMS. As per the data modeling perspective, you should be aware of which column values you have to partition the data if you have a date necessary in data analysis so you might have to partition the data yearly, monthly, weekly, etc.
let us create a partition table called orders which we previously created and the only difference is the partition clause.
create TABLE orders ( order_id INT, order_date DATE, order_cust_id INT, order_status string ) PARTITIONED BY (order_month string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
If you list the files then you will not see anything but if you want to add a partition then use add partition command as shown below. If we want to retrieve the first 10 dates and 10 characters because the date is 10 characters from the order month then we can use the Unix time function
select from_unixtime(cast(substr(order_date, 1, 10) AS bigint)) from orders limit 10;
🎯 Let us print all the details including id, status, date, and month.
select order_id, from_unixtime(cast(substr(order_date, 1, 10) AS bigint)) order_date, order_cust_id, order_status, substr(from_unixtime(cast(substr(order_date, 1, 10) AS bigint)), 1, 7) order_month from orders limit 10;
Create Hive Bucketed Table (Bucketing)
Bucketing is similar to hash partitioning. The syntactical difference between partitioning and bucketing is that in the partition you not only specify the column names but also the data type. The concept of partitioning helps segregate the data in multiple directories while this feature might not be helpful in all scenarios. Suppose you want to partition tables based on the geographic condition or population of the country then every country will have a different number of partitions. Hence to solve this problem Bucketing is introduced in Hive.
In short bucketing is a hashing technique where the number of required buckets and data in each bucket depends on a certain hash function. And the bucketed table will create almost equally distributed data buckets. To divide the table into buckets we use Clustered By Clause.
Advantages of Bucketing
- Bucketed tables offer better sampling compared to non-bucketed tables.
- Bucketing can also be performed without partitioning.
- Similar to partitioning bucketed tables also offer a fast query response compared to non-bucketed tables.
- Bucketing offers flexibility to data storage in each bucket like you can sort the data based on one or more columns.
create table orders bucket ( order_id INT, order_date DATE, order_cust_id INT, order_status string ) CLUSTERED BY (order_id) INTO 16 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
Transactions in Hive (Insert, Update, Delete)
We have learned to partition, bucketing, query, and retrieve the data from Hive. It is very similar to SQL and the only benefit is it is scalable and can work with different file formats with efficient data analyzing and processing quality. It’s time to learn how we can insert, update or delete the data from Hive tables which is also known as DML in the standard database. First, create a sample table to learn how to perform transactions in the hive.
CREATE Table hive_transactions ( i INT, j string );
Just like SQL the syntax for inserting the data directly into the table is the same as using the Insert command.
INSERT INTO table hive_transactions- values (1, 'Analytics Vidhya');
You can run the SELECT command to extract the data. Now If you want to delete the transactions so you will use the delete command but the query will fail because unlike standard databases the hive has certain limitations on transactions. In DBMS you have the option to control the transaction through commit and rollback but in the hive, it is always auto-commit which means if the transaction is successful then it will be visible else not visible. Hive only supports transactions on ORC format files and the table needs to be bucketed.
CREATE Table hive_transactions (i INT, j string) CLUSTERED BY (i) INTO 4 buckets STORED AS ORC tblproperties('transactional' = 'true');
After running the above create command again try to insert the data into a table and it will execute fine. And now if we run the Update command then it will work.
UPDATE hive_transactions set j = 'vidhya' where i = 1;
You can preview the data then there is another directory where the file is updated. We can also perform a delete operation to delete a particular row.
DELETE from hive_transactions WHERE i = 1;
This is all about transactions performed in Hive and you can try advance inserts and inserting multiple rows at a single time in Hive.
Hive is a data warehousing tool that makes analyzing and processing of big data very simple, and efficient over Hadoop where with help of simple SQL queries along with some Delimiter and file formats you can control the massive amount of data. Hive proves to be the best Hadoop tool for data analysis purposes. In this article, we have started Hive with the theoretical part and then set up Hive on our system and played with different Hive queries. Let us summarize the important key points that we learned and need to be remembered in regards to Hive.
- Hive is a data warehousing tool that is built on top of Hadoop used for analyzing data and helping to generate intellectual reports with help of BI tools.
- Two main concepts to arrange and play with data in Hive are Partitioning and Bucketing.
- Partitioning will only work when there is a limited number of partitions or when we have an equal number of size partitions. Otherwise, we need to use Bucketing.
- Bucketing also creates the evenly stored data in buckets with help of the Hash and modulus function.
- While creating a table using select clauses you cannot change or customize the name and type of columns which we can do in some RDBMS software.
👉 I hope that it was easy to cope-up with each step and easily understandable. If you have any queries then feel free to post them in the comment section below or can connect with me. I hope you liked my article on hive query language.
👉 Connect with me on Linkedin.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.