Getting Started with NoSQL Database Called HBase

Devashree Madhugiri 19 May, 2022 • 5 min read

This article was published as a part of the Data Science Blogathon.

HBase is an open-source non-relational, scalable, distributed database written in Java. It is developed as a part of the Hadoop ecosystem and runs on top of HDFS. It provides random real-time read and write access to the given data. It is possible to write NoSQL queries to get the results using APIs. In fact, it is modelled on the basis of Google’s Big Table.

Introduction to HBase

HBase is a database for storing and retrieving data with random access. In other words, we can write data as per our requirements and read it back again as needed. HBase works well with both structured and semi-structured data. This means storing structured data like relational tables and semi-structured data like tweets or log files together is possible. If the data is not large, HBase can also handle unstructured data. It supports various data types; has a dynamic and flexible data model that does not restrict the kind of data to be stored. The data is stored in key-value pairs in column-oriented databases, which use column families to club similar frequently accessed data together. It is also horizontally scalable.

When to use it?

Apache Hadoop is not suitable for real-time analytics and hence, might not always be the perfect framework choice for big data. HBase is the right option in such scenarios, i.e. when real-time big data querying is required. For applications requiring random read-write operations for big data, HBase is the ideal solution. It offers a set of powerful inbuilt APIs for pulling or pushing data. HBase can be nicely integrated with Apache Hadoop’s MapReduce for tasks requiring bulk operations (indexing, analytics, etc.). To use this in the best possible way, we can make Hadoop the static data repository while HBase stores the real-time data which further needs processing.

Features

Linear Scalability – It is a distributed database that runs on a cluster of computers. Several commodity hardware forms an HBase cluster.

High Throughput – Due to the high security and easy management characteristics, there is a high write throughput.

Automatic Sharding – This is an interesting feature wherein tables are dynamically distributed by the system to different region servers when they reach a larger size than the threshold. Auto-sharding means splitting and serving regions.

Atomic Read and Write – Atomicity indicates that an operation occurs or does not occur at all. Hence, no other read and write operations can be performed during one read or write process.

Real-time and Random big data access – It accepts real-time random data and stores it internally using a log-structured merge tree. This type of data storage allows the merging of smaller files to larger files periodically and reduces the ultimate disk usage.

Built-in support of MapReduce – It provides fast and parallel processing of stored data using the built-in Hadoop MapReduce framework support.

API Support – It provides strong Java API support (client/server) for easy development and programming.

Shell Support – It provides a command-line tool to interact with HBase and perform simple operations like creating a table, adding data, etc.

Sparse & multidimensional database – It is a sparse, multidimensional, sorted map-based database that supports multiple versions of the same record.

Snapshot support – It allows you to take metadata snapshots in order to obtain the prior or correct state form of data.

Architecture

HBase Architecture consists mainly of four components:

HMaster
HRegions
HRegionserver
ZooKeeper

HMaster

In HBase, the HMaster is the implementation of a Master server. It serves as a monitoring agent for all Region Server instances in the cluster and a user interface for any metadata updates. Master runs on NameNode in a distributed cluster environment. Master manages a number of background threads. HMaster is responsible for the following functions in HBase.

HMaster manages admin performance and distributes services to regional servers.
HMaster assigns regions to region servers.
Plays a critical role in cluster performance and node maintenance.
HMaster includes important functions like regulating load balancing and failover for distributing the load between the cluster’s nodes.
HMaster is responsible for any schema and Metadata actions that a client wishes to alter.

HMaster Interface provides certain methods that are largely metadata-focused like –

Table – createTable, removeTable, enable, disable
ColumnFamily – add Column, modify Column
Region – move, assign

The client communicates with HMaster and ZooKeeper in both directions. It communicates directly with HRegion servers for read and write activities. HMaster sends regions to region servers and also checks the health of region servers. We have various region servers across the architecture. Hlog is present in region servers and will store all log files.

HBase Region Servers

When the HBase Region Server gets a write or read request from a client, it routes the request to the region that contains the actual column family. However, the client can communicate directly with HRegion servers; no HMaster authorization is required for the client to communicate with HRegion servers. The client can ask for HMaster’s assistance in modifying metadata and schema.

HRegionServer

The Region Server implementation is HRegionServer. It is in charge of providing and managing regions or data in a distributed cluster. The region servers are hosted on Data Nodes in the Hadoop cluster. HMaster can connect with several HRegion servers and execute the following tasks.

Region hosting and management
Automatic region splitting
Processing read and write requests
Direct communication with the client

HBase Regions

HRegions are the primary building blocks of an HBase cluster that consist of table distribution and are made up of Column families. It has several stores, and each store pertains to one column family. It is primarily made up of two components:

Memstore
Hfile

ZooKeeper

HBase ZooKeeper is a centralized monitoring server that keeps configuration data and offers distributed synchronization. Distributed synchronization means gaining access to distributed applications running throughout the cluster and offering coordination services across nodes.

Clients connect to regions via a ZooKeeper. The ZooKeeper is an open-source project that provides several essential services.

ZooKeeper Services

Keeps configuration information
Client Communication with area servers is established via distributed synchronization
Provides transient nodes that represent various area servers
Master servers can use ephemeral nodes to discover available servers in a cluster
To keep track of server failures and network partitions

The master and slave HBase nodes (region servers) are registered with ZooKeeper. A client requires access to the ZooKeeper quorum (zkQuorum) settings to connect with master and region servers. When there is a node failure in HBase, zkQuorum generates error messages and begins to repair the failing nodes.

Applications

HBase finds its use in domains where –

a massive amount of non-relational data (petabytes) with variable schema is a regular feature.
the data needs to be accessed faster.
HDFS support is available along with a large number of nodes.

Now, let us look at some important applications of HBase.

Sports: It is used in sports to store match history for improved analytics and prediction.

Medical: It is used in the medical industry for storing genomic sequences to record the sickness history of people and areas.

E-commerce: It is used in e-commerce to capture and store consumer logs including search history for analytics and subsequently target advertising.

Web: It is used to maintain user history and preferences in order to improve consumer targeting.

Apart from the above, there could multiple applications of HBase in industries like Oil and Petroleum, Marketing and Advertising, Banking, Stock Market, and more.

Conclusion

We have briefly covered what HBase is, situations where it is used, and HBase features and architecture.

In conclusion, here are some of the key takeaways from the article –

It can be said that HBase has proved to be a powerful tool over existing Hadoop environments.
HBase is a popular NoSQL database with high throughput and low latency.
Since its release, HBase garnered developer support from other companies and was adopted by companies for production deployment.
As of now, HBase boasts of strong developer and user communities.
It is a top-level Apache project that has become a core infrastructure component run on a production scale worldwide in several large organizations such as Facebook, Twitter, Salesforce, Trend Micro, and Adobe.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.