Prashant Sharma — Published On August 1, 2022 and Last Modified On August 8th, 2022
Beginner Data Engineering Interview Questions Interviews

This article was published as a part of the Data Science Blogathon.

Introduction

HBase is a column-oriented non-relational database management system that operates on Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant manner of storing sparse data sets, which are prevalent in several big data use cases. It is ideal for real-time data processing or random read/write access to large data volumes. In contrast to relational databases like SQL, HBase doesn’t provide a structured query language like those provided by that database.

Apache HBase
Source: hbase.apache.org

HBase is a data model that works like Google’s “big table” to make it easy to get to a lot of structured data quickly. It comprises a set of tables that store data in a key-value format. Programmers can use Hbase’s APIs in whatever programming language they want. Data in the Hadoop File System may be read and written in real time using this element of the Hadoop ecosystem.

Either directly or via HBase, data may be stored in HDFS. The data consumer uses HBase to read/access HDFS data at random. Read and write access to the Hadoop File System is provided by HBase.

Features

  • Any number of columns can be added to the horizontal scalability at any moment.

  • A multidimensional sorted map is indexed by row key, column key, and timestamp in a distributed manner.

  • In the case of a system breach, an administrator can use automatic failover to automatically transition data handling to a standby system.

  • Built on top of the Hadoop Distributed File System, each command and Java code implements Map/Reduce internally to complete the operation.

  • Frequently referred to as a key-value store, column family-oriented database, or for storing versioned maps of maps.

  • It is basically a system for storing and retrieving data with random access.

  • It does not impose relationships between data elements.

  • It is intended to run on a cluster of commodity hardware-based computers.

Interview Questions

1. What is Apache HBase’s purpose?

Apache HBase is used when random, real-time read/write access to Big Data is required. The objective of this project is to host tables with billions of rows and millions of columns on clusters of commodity hardware. Apache HBase is a distributed, versioned, non-relational, open-source database inspired by Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Apache HBase delivers Bigtable-like functionality on top of Hadoop and HDFS, much as Bigtable utilizes the distributed data storage provided by the Google File System.

2. What are the major elements of HBase?

Major elements of HBase are:

  • Zookeeper: It performs coordination work between the client and HBase Master.

  • HBase Master: HBase Master keeps an eye on the Region Server.

  • RegionServer: RegionServer is responsible for monitoring the Region.

  • Region: It contains both the in-memory data store (MemStore) and the Hfile.

  • Catalog Tables: Tables in catalogs consist of ROOT and META.

3. Examine the purpose of filters in HBase.

Filters were added to Apache HBase 0.92 to make it easier for users to access HBase through Shell or Thrift. As a result, they handle your server-side filtering requirements. There are also beautifying filters, which allow you to get more control over the data produced by filters. Here are some HBase filter examples:

  • Bloom Filter: A space-efficient means of determining if an HFile contains a given row or cell, it is typically used for real-time queries.

  • Page Filter: The Page Filter can optimize the scan of particular HRegions by accepting the page size as a parameter.


4. How does HBase handle a failed write?

In big distributed systems, failures are common, and HBase is no exception.

If the server hosting a MemStore that has not yet been drained crashes. The data in memory, but not yet persisted, are gone. HBase prevents this by writing to the WAL before the write operation is finished. Every server included in the.

The HBase cluster maintains a WAL to record changes as they occur. The WAL is a file on the file system beneath the WAL. Writing is not successful until the new WAL entry has been successfully written. This promise ensures that HBase is as robust as the support file system. HBase is supported by Hadoop Distributed Filesystem most of the time (HDFS). If HBase fails, the data that have not yet been flushed from MemStore to HFile can be retrieved by replaying the WAL.

 

5. Describe deletion in HBase. What are the three types of tombstone markers supported by HBase?

When a cell is deleted in HBase, the data is not truly removed; instead, a tombstone marker is placed, rendering the deleted cell inaccessible. HBase that has been deleted is removed during compactions.

There are three types of tombstone markers:

  • Version delete marker: It identifies a single version of a column for deletion.

  • Column delete marker: It flags for deletion of every version of a column.

  • Family delete marker: It flags every column in a column family for deletion.

6. How does HBase compare to Cassandra?

Cassandra and HBase are both NoSQL databases, a word that has several definitions. Typically, it indicates that SQL cannot be used to manipulate the database. Nonetheless, Cassandra has implemented CQL (Cassandra Query Language), whose syntax is evidently based on SQL.

Both are intended to manage enormous data collections. According to the HBase documentation, an HBase database should include hundreds of millions or, preferably, billions of records. If not, you should continue with a relational database management system.

Not just in terms of how data is kept but also in terms of how the data may be accessed; both are distributed databases. Clients can connect to any cluster node and have access to any data.

HBase lacks native support for secondary indexes but provides a range of methodologies that enable secondary index functionality. These are outlined in the online reference guide for HBase and the HBase community.

7. What happens when the block size of a column family in a previously populated database is altered?

When you modify the block size of a column family, the new data will occupy the new block size, but the old data will stay in the old block size. In the course of data compression, old data will adopt the new block size. As new files are flushed, their block size will change, although current data will remain accurately read. After the next major data compression, all data must be converted to the new block size.

8. Why would you use HBase?

  • High storage capacity system

  • Distributed layout to accommodate big tables

  • Column-Oriented Stores

  • Horizontally Scalable

  • Superior functionality & Availability

  • HBase aims for at least millions of columns, thousands of versions, and billions of rows.

  • Unlike HDFS (Hadoop Distributed File System), it provides CRUD operations in random real-time.

9. What is the Hbase standalone mode?

This option can be enabled when users do not require Hbase to access the HDFS. It is basically a default mode in Hbase, and users are typically allowed to use it whenever they choose. When the user selects this option, the Hbase uses a file system rather than HDFS.

It is possible to save a significant amount of time by using this mode when doing some key activities. During this mode, you may also impose or remove various time constraints on the data.

10. Contrast HBase and Hive?

Hive can enable SQL-savvy users to perform MapReduce jobs. Since it is JDBC-compliant, it is also compatible with current SQL-based applications. Since Hive queries traverse all of the table’s contents by default, their execution may be time-consuming. Nonetheless, Hive’s partitioning function can restrict the volume of data. Partitioning enables the execution of a filter query across data stored in distinct folders and the reading of just the data that matches the query. It might be used, for instance, to only process files generated between specific dates if the file names contain the date format.

HBase operates by storing data as key/value. It provides four core operations: put for adding or updating rows, scan for retrieving a range of cells, get for returning cells for a particular row, and delete for removing rows, columns, or column variants. Versioning is provided to retrieve past data values (the history can be deleted now and then to clear space via HBase compactions). Although HBase contains tables, a schema is necessary only for tables and column families but not for individual columns, and increment/counter functionality is supported.

MapReduce tasks operate on Hive, a SQL-like engine; HBase, a NoSQL key/value database, runs on Hadoop.

Conclusion

This article provides information about HBase, a column-oriented non-relational database management system, and covers a variety of topics, I hope that this information was useful and that you are now more prepared for the next interviews. Here are some of the article’s most salient points:

  • What is HBase, and what are its features?
  • HBase filters and modes are available.
  • HBase comparisons with Hive and Cassandra, as well as many other topics at the basic, intermediate, and tough levels.

Please share your feedback about the article topic, Apache HBase, in the comments section below. Check out more interview questions articles here.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

About the Author

Prashant Sharma

Currently, I Am pursuing my Bachelors of Technology( B.Tech) from Vellore Institute of Technology. I am very enthusiastic about programming and its real applications including software development, machine learning and data science.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *