Harshitn — January 6, 2022
Beginner Data Cleaning Data Engineering Data Warehouse Database
Hive

INTRODUCTION

Hive is one of the most popular data warehouse systems in the industry for data storage, and to store this data Hive uses tables. Tables in the hive are analogous to tables in a relational database management system. Each table belongs to a directory in HDFS. By default, it is /user/hive/warehouse directory. For instance, a table named students will be located at /user/hive/warehouse/students.

In this article we shall discuss the two types of tables present in Hive:

1. INTERNAL TABLE (Managed Table)

2. EXTERNAL TABLE

Internal Table

When a user creates a table in Hive it is by default an internal table created in the /user/hive/warehouse directory in HDFS which is its default storage location. The data present in the internal table will be stored in this directory and is fully managed by Hive and thus an internal table is also referred to as a managed table.

Creating a Table

 

For creating an internal table, we use the following command:

CREATE TABLE IF NOT EXISTS student_internal
( 
  name string,
  class ARRAY,
  gender_age STRUCT,
  subj_score MAP
                 )
COMMENT ' Internal student table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE;

As seen above, the internal table has 4 columns-

Column Type
 name String
 class String array
 gender_age Struct (to hold different data types in one structure)
subj_score MAP(to hold subjects and their scores)

 

 

 

 

 

By default, the table created in the hive is an internal table, so we need not specify an internal keyword while table creation.

In the above query, we use the “comment”  keyword to refer to the table as “Internal student table” for our own reference, but you can comment it anything you want.

By using the keywords rows format delimited and fields terminated by (’|’) we inform Hive about the format of the data. The columns can be differentiated using the pipe symbol (|). This helps hive to understand how to read and write the data.

Map keys in the ‘subj_score’ column can be differentiated by the (:) symbol and the collection items in the ‘gender_age’ struct is differentiated using the (,) symbol.

Now that we have learned to create an internal table let’s see how we can load the data and populate it.

Loading the Data

To load the data, we use the following command:

LOAD DATA LOCAL INPATH '/home/Hadoop/student.txt'
OVERWRITE INTO TABLE student_internal

Here we are importing the data from a file present in the local HDFS path-‘/home/Hadoop/student.txt’ and overwriting or we can say loading it into the student_internal table we created above.

Storage

An internal table is stored on HDFS in the /user/hive/warehouse directory which is its default storage location. This location can be changed by updating the path in the configuration file present in the config file – hive.metastore.warehouse.dir.

We can also alter the location of the table by providing a new path present in HDFS using the SET LOCATION clause.

 ALTER TABLE student_internal
SET LOCATION
 'hdfs://localhost:8020/user/tables/student';

Security

Hive is exclusively responsible for the security and management of the data present in the internal table.

If we delete the table using the drop command, then the metadata, as well as the data present in the table, will get deleted from the master node and HDFS respectively. So, it is strongly advised to never use the drop command on an internal table because it may be serving a production line of a business.

As a solution to this problem, we can use ENABLE NO_DROP clause to alter the table which will prevent the table from getting dropped.

> ALTER TABLE student_internal ENABLE NO_DROP;

If we want to prevent the data from being queried, then we can use the ENABLE ONLINE clause.

> ALTER TABLE student_internal ENABLE OFFLINE;

Usage

We may use an internal table if:

1. Data is temporary and doesn’t affect businesses in real-time.

2. If we want the hive to manage the data and the tables.

 

External Table

When a user creates a table in Hive specifying the external keyword, then an external table is created. The data present in the external table will be fully managed by HDFS contrary to an internal table.

Creating a Table

For creating an external table, we use the flowing command:

>CREATE EXTERNAL TABLE  student_external
(
 name string,
class ARRAY,
gender_age STRUCT,
subj_score MAP
            )
                    COMMENT ' External student table'
       ROW FORMAT DELIMITED
           FIELDS TERMINATED BY '|'
                    COLLECTION ITEMS TERMINATED BY ','
             MAP KEYS TERMINATED BY ':'
      STORED AS TEXTFILE;
                     LOCATION '/user/tables/students';

                As seen above, the external table has 4 columns-

Column Type
name String
class String array
gender_age Struct (to hold different data types in one structure)
subj_score MAP(to hold subjects and their scores)

 

 

 

 

 

External table creation query has the same intuition and syntax as the internal table.

Note that while creating an external table we specify the keyword EXTERNAL to inform Hive to create an external table for us.

Now that we have learned to create an external table let’s see how we can load the data and populate it.

Loading the Data

To load the data, we use the following command:

>LOAD DATA LOCAL INPATH '/home/Hadoop/student.txt'
 OVERWRITE INTO TABLE student_external

Similar to the internal table, we are importing the data from a file present in the local HDFS path-‘/home/Hadoop/student.txt’ and overwriting or we can say loading it into the student_external table we created above.

Storage

An external table is stored on HDFS or any storage compatible with HDFS, because we want to use the data outside of Hive. Thus, Hive is not responsible for managing the storage of the external table. Tables can be stored on an external location for instance on a cloud platform like google cloud or AWS.

Security

External tables and the source files are linked but not entirely. There is some level of abstraction between the external table and the source files.

Let me explain through an example, suppose there is an external table named ‘ext_student’ in Hive which is accessing the data through the source file named ‘studentdb.txt’, then if I delete the table ‘ext_student’ then the file linked to it ‘studentdb.txt’ will not get deleted from HDFS.

So, by deleting the table from Hive, only the metadata from the master node gets deleted and not the actual source file.

Now, this presents an interesting situation as we are no longer able to query the data as the schema gets deleted when we drop the table, so to get back the ability to query the data we can simply create the same table again using the table creation commands and point it to the same location as it was previously and we can query the data seamlessly again.

Security of the external tables is managed at the HDFS level as anyone having access to the HDFS file structure can access an external table.

Usage 

We may use an external table if:

1. We want to use data outside HIVE for performing a different operations such as loading and merging.

2. The data is of production quality.

 

ENDNOTES

In this article, we have learned to differentiate between internal and external tables in Hive on various dimensions namely storage, security and usage.

If you have any questions related to this article do let me know in the comments section below.

Read more articles on Hive on our blog. Click here.

About the Author

Harshitn

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *