HIVE: INTERNAL AND EXTERNAL TABLES

harshitn Last Updated : 06 Jan, 2022

5 min read

INTRODUCTION

Hive is one of the most popular data warehouse systems in the industry for data storage, and to store this data Hive uses tables. Tables in the hive are analogous to tables in a relational database management system. Each table belongs to a directory in HDFS. By default, it is /user/hive/warehouse directory. For instance, a table named students will be located at /user/hive/warehouse/students.

In this article we shall discuss the two types of tables present in Hive:

1. INTERNAL TABLE (Managed Table)

2. EXTERNAL TABLE

Internal Table

When a user creates a table in Hive it is by default an internal table created in the /user/hive/warehouse directory in HDFS which is its default storage location. The data present in the internal table will be stored in this directory and is fully managed by Hive and thus an internal table is also referred to as a managed table.

_{Creating a Table}

For creating an internal table, we use the following command:

CREATE TABLE IF NOT EXISTS student_internal
( 
  name string,
  class ARRAY,
  gender_age STRUCT,
  subj_score MAP
                 )
COMMENT ' Internal student table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE;

As seen above, the internal table has 4 columns-

Column	Type
name	String
class	String array
gender_age	Struct (to hold different data types in one structure)
subj_score	MAP(to hold subjects and their scores)

By default, the table created in the hive is an internal table, so we need not specify an internal keyword while table creation.

In the above query, we use the “comment” keyword to refer to the table as “Internal student table” for our own reference, but you can comment it anything you want.

By using the keywords rows format delimited and fields terminated by (’|’) we inform Hive about the format of the data. The columns can be differentiated using the pipe symbol (|). This helps hive to understand how to read and write the data.

Map keys in the ‘subj_score’ column can be differentiated by the (:) symbol and the collection items in the ‘gender_age’ struct is differentiated using the (,) symbol.

Now that we have learned to create an internal table let’s see how we can load the data and populate it.

_{Loading the Data}

To load the data, we use the following command:

LOAD DATA LOCAL INPATH '/home/Hadoop/student.txt'
OVERWRITE INTO TABLE student_internal

Here we are importing the data from a file present in the local HDFS path-‘/home/Hadoop/student.txt’ and overwriting or we can say loading it into the student_internal table we created above.

_Storage

An internal table is stored on HDFS in the /user/hive/warehouse directory which is its default storage location. This location can be changed by updating the path in the configuration file present in the config file – hive.metastore.warehouse.dir.

We can also alter the location of the table by providing a new path present in HDFS using the SET LOCATION clause.

 ALTER TABLE student_internal
SET LOCATION
 'hdfs://localhost:8020/user/tables/student';

_Security

Hive is exclusively responsible for the security and management of the data present in the internal table.

If we delete the table using the drop command, then the metadata, as well as the data present in the table, will get deleted from the master node and HDFS respectively. So, it is strongly advised to never use the drop command on an internal table because it may be serving a production line of a business.

As a solution to this problem, we can use ENABLE NO_DROP clause to alter the table which will prevent the table from getting dropped.

> ALTER TABLE student_internal ENABLE NO_DROP;

If we want to prevent the data from being queried, then we can use the ENABLE ONLINE clause.

> ALTER TABLE student_internal ENABLE OFFLINE;

_Usage

We may use an internal table if:

1. Data is temporary and doesn’t affect businesses in real-time.

2. If we want the hive to manage the data and the tables.

External Table

When a user creates a table in Hive specifying the external keyword, then an external table is created. The data present in the external table will be fully managed by HDFS contrary to an internal table.

_{Creating a Table}

For creating an external table, we use the flowing command:

>CREATE EXTERNAL TABLE  student_external
(
 name string,
class ARRAY,
gender_age STRUCT,
subj_score MAP
            )
                    COMMENT ' External student table'
       ROW FORMAT DELIMITED
           FIELDS TERMINATED BY '|'
                    COLLECTION ITEMS TERMINATED BY ','
             MAP KEYS TERMINATED BY ':'
      STORED AS TEXTFILE;
                     LOCATION '/user/tables/students';

As seen above, the external table has 4 columns-

Column	Type
name	String
class	String array
gender_age	Struct (to hold different data types in one structure)
subj_score	MAP(to hold subjects and their scores)

External table creation query has the same intuition and syntax as the internal table.

Note that while creating an external table we specify the keyword EXTERNAL to inform Hive to create an external table for us.

Now that we have learned to create an external table let’s see how we can load the data and populate it.

_{Loading the Data}

To load the data, we use the following command:

>LOAD DATA LOCAL INPATH '/home/Hadoop/student.txt'
 OVERWRITE INTO TABLE student_external

Similar to the internal table, we are importing the data from a file present in the local HDFS path-‘/home/Hadoop/student.txt’ and overwriting or we can say loading it into the student_external table we created above.

_Storage

An external table is stored on HDFS or any storage compatible with HDFS, because we want to use the data outside of Hive. Thus, Hive is not responsible for managing the storage of the external table. Tables can be stored on an external location for instance on a cloud platform like google cloud or AWS.

_Security

External tables and the source files are linked but not entirely. There is some level of abstraction between the external table and the source files.

Let me explain through an example, suppose there is an external table named ‘ext_student’ in Hive which is accessing the data through the source file named ‘studentdb.txt’, then if I delete the table ‘ext_student’ then the file linked to it ‘studentdb.txt’ will not get deleted from HDFS.

So, by deleting the table from Hive, only the metadata from the master node gets deleted and not the actual source file.

Now, this presents an interesting situation as we are no longer able to query the data as the schema gets deleted when we drop the table, so to get back the ability to query the data we can simply create the same table again using the table creation commands and point it to the same location as it was previously and we can query the data seamlessly again.

Security of the external tables is managed at the HDFS level as anyone having access to the HDFS file structure can access an external table.

Usage

We may use an external table if:

1. We want to use data outside HIVE for performing a different operations such as loading and merging.

2. The data is of production quality.

ENDNOTES

In this article, we have learned to differentiate between internal and external tables in Hive on various dimensions namely storage, security and usage.

If you have any questions related to this article do let me know in the comments section below.

Read more articles on Hive on our blog. Click here.

harshitn

Beginner Data Cleaning Data Engineering Data Warehouse Database

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

HIVE: INTERNAL AND EXTERNAL TABLES

INTRODUCTION

Internal Table

Creating a Table

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

_{Creating a Table}