Hadoop Distributed File System (HDFS) Architecture – A Guide to HDFS for Every Data Engineer
- Get familiar with Hadoop Distributed File System (HDFS)
- Understand the Components of HDFS
In contemporary times, it is commonplace to deal with massive amounts of data. From your next WhatsApp message to your next Tweet, you are creating data at every step when you interact with technology. Now multiply that by 4.5 billion people on the internet – the math is simply mind-boggling!
But ever wondered how to handle such data? Is it stored on a single machine? What if the machine fails? Will you lose your lovely 3 AM tweets *cough*?
The answer is No. I am pretty sure you are already thinking about Hadoop. Hadoop is an amazing framework. With Hadoop by your side, you can leverage the amazing powers of Hadoop Distributed File System (HDFS)-the storage component of Hadoop. It is probably the most important component of Hadoop and demands a detailed explanation.
So, in this article, we will learn what Hadoop Distributed File System (HDFS) really is and about its various components. Also, we will see what makes HDFS tick – that is what makes it so special. Let’s find out!
Table of Contents
- What is Hadoop Distributed File System (HDFS)?
- What are the components of HDFS?
- Blocks in HDFS?
- Namenode in HDFS
- Datanodes in HDFS
- Secondary Node in HDFS
- Replication Management
- Replication of Blocks
- What is a Rack in Hadoop?
- Rack Awareness
What is Hadoop Distributed File System(HDFS)?
It is difficult to maintain huge volumes of data in a single machine. Therefore, it becomes necessary to break down the data into smaller chunks and store it on multiple machines.
Filesystems that manage the storage across a network of machines are called distributed file systems.
Hadoop Distributed File System (HDFS) is the storage component of Hadoop. All data stored on Hadoop is stored in a distributed manner across a cluster of machines. But it has a few properties that define its existence.
- Huge volumes – Being a distributed file system, it is highly capable of storing petabytes of data without any glitches.
- Data access – It is based on the philosophy that “the most effective data processing pattern is write-once, the read-many-times pattern”.
- Cost-effective – HDFS runs on a cluster of commodity hardware. These are inexpensive machines that can be bought from any vendor.
What are the components of the Hadoop Distributed File System(HDFS)?
HDFS has two main components, broadly speaking, – data blocks and nodes storing those data blocks. But there is more to it than meets the eye. So, let’s look at this one by one to get a better understanding.
HDFS breaks down a file into smaller units. Each of these units is stored on different machines in the cluster. This, however, is transparent to the user working on HDFS. To them, it seems like storing all the data onto a single machine.
These smaller units are the blocks in HDFS. The size of each of these blocks is 128MB by default, you can easily change it according to requirement. So, if you had a file of size 512MB, it would be divided into 4 blocks storing 128MB each.
If, however, you had a file of size 524MB, then, it would be divided into 5 blocks. 4 of these would store 128MB each, amounting to 512MB. And the 5th would store the remaining 12MB. That’s right! This last block won’t take up the complete 128MB on the disk.
But, you must be wondering, why such a huge amount in a single block? Why not multiple blocks of 10KB each? Well, the amount of data with which we generally deal with in Hadoop is usually in the order of petra bytes or higher.
Therefore, if we create blocks of small size, we would end up with a colossal number of blocks. This would mean we would have to deal with equally large metadata regarding the location of the blocks which would just create a lot of overhead. And we don’t really want that!
There are several perks to storing data in blocks rather than saving the complete file.
- The file itself would be too large to store on any single disk alone. Therefore, it is prudent to spread it across different machines on the cluster.
- It would also enable a proper spread of the workload and prevent the choke of a single machine by taking advantage of parallelism.
Now, you must be wondering, what about the machines in the cluster? How do they store the blocks and where is the metadata stored? Let’s find out.
Namenode in HDFS
HDFS operates in a master-worker architecture, this means that there are one master node and several worker nodes in the cluster. The master node is the Namenode.
Namenode is the master node that runs on a separate node in the cluster.
- Manages the filesystem namespace which is the filesystem tree or hierarchy of the files and directories.
- Stores information like owners of files, file permissions, etc for all the files.
- It is also aware of the locations of all the blocks of a file and their size.
All this information is maintained persistently over the local disk in the form of two files: Fsimage and Edit Log.
- Fsimage stores the information about the files and directories in the filesystem. For files, it stores the replication level, modification and access times, access permissions, blocks the file is made up of, and their sizes. For directories, it stores the modification time and permissions.
- Edit log on the other hand keeps track of all the write operations that the client performs. This is regularly updated to the in-memory metadata to serve the read requests.
Whenever a client wants to write information to HDFS or read information from HDFS, it connects with the Namenode. The Namenode returns the location of the blocks to the client and the operation is carried out.
Yes, that’s right, the Namenode does not store the blocks. For that, we have separate nodes.
Datanodes in HDFS
Datanodes are the worker nodes. They are inexpensive commodity hardware that can be easily added to the cluster.
Datanodes are responsible for storing, retrieving, replicating, deletion, etc. of blocks when asked by the Namenode.
They periodically send heartbeats to the Namenode so that it is aware of their health. With that, a DataNode also sends a list of blocks that are stored on it so that the Namenode can maintain the mapping of blocks to Datanodes in its memory.
But in addition to these two types of nodes in the cluster, there is also another node called the Secondary Namenode. Let’s look at what that is.
Secondary Namenode in HDFS
Suppose we need to restart the Namenode, which can happen in case of a failure. This would mean that we have to copy the Fsimage from disk to memory. Also, we would also have to copy the latest copy of Edit Log to Fsimage to keep track of all the transactions. But if we restart the node after a long time, then the Edit log could have grown in size. This would mean that it would take a lot of time to apply the transactions from the Edit log. And during this time, the filesystem would be offline. Therefore, to solve this problem, we bring in the Secondary Namenode.
Secondary Namenode is another node present in the cluster whose main task is to regularly merge the Edit log with the Fsimage and produce check‐points of the primary’s in-memory file system metadata. This is also referred to as Checkpointing.
But the checkpointing procedure is computationally very expensive and requires a lot of memory, which is why the Secondary namenode runs on a separate node on the cluster.
However, despite its name, the Secondary Namenode does not act as a Namenode. It is merely there for Checkpointing and keeping a copy of the latest Fsimage.
Replication Management in HDFS
Now, one of the best features of HDFS is the replication of blocks which makes it very reliable. But how does it replicate the blocks and where does it store them? Let’s answer those questions now.
Replication of blocks
HDFS is a reliable storage component of Hadoop. This is because every block stored in the filesystem is replicated on different Data Nodes in the cluster. This makes HDFS fault-tolerant.
The default replication factor in HDFS is 3. This means that every block will have two more copies of it, each stored on separate DataNodes in the cluster. However, this number is configurable.
But you must be wondering doesn’t that mean that we are taking up too much storage. For instance, if we have 5 blocks of 128MB each, that amounts to 5*128*3 = 1920 MB. True. But then these nodes are commodity hardware. We can easily scale the cluster to add more of these machines. The cost of buying machines is much lower than the cost of losing the data!
Now, you must be wondering, how does Namenode decide which Datanode to store the replicas on? Well, before answering that question, we need to have a look at what is a Rack in Hadoop.
What is a Rack in Hadoop?
A Rack is a collection of machines (30-40 in Hadoop) that are stored in the same physical location. There are multiple racks in a Hadoop cluster, all connected through switches.
Replica storage is a tradeoff between reliability and read/write bandwidth. To increase reliability, we need to store block replicas on different racks and Datanodes to increase fault tolerance. While the write bandwidth is lowest when replicas are stored on the same node. Therefore, Hadoop has a default strategy to deal with this conundrum, also known as the Rack Awareness algorithm.
For example, if the replication factor for a block is 3, then the first replica is stored on the same Datanode on which the client writes. The second replica is stored on a different Datanode but on a different rack, chosen randomly. While the third replica is stored on the same rack as the second but on a different Datanode, again chosen randomly. If, however, the replication factor was higher, then the subsequent replicas would be stored on random Data Nodes in the cluster.
I hope by now you have got a solid understanding of what Hadoop Distributed File System(HDFS) is, what are its important components, and how it stores the data. There are however still a few more concepts that we need to cover with respect to Hadoop Distributed File System(HDFS), but that is a story for another article.
For now, I recommend you go through the following articles to get a better understanding of Hadoop and this Big Data world!
Last but not the least, I recommend reading Hadoop: The Definitive Guide by Tom White. This article was highly inspired by it.