The Big Picture

This lesson gives the reader new perspective on HDFS.

The Big Picture

In this lesson, we’ll discuss the architecture of HDFS, its goals, and its limitations. The Hadoop Distributed File System (HDFS) was designed with the following goals in mind:

  • Large files: The system should store large files comprising of several hundred gigabytes or petabytes.

  • Streaming data access: HDFS is optimized and built for a write-once and read-many-times pattern. Having the time to read the entire dataset is more important than the latency in reading the first record. HDFS doesn’t support multiple writers. Existing files on the system can only be appended to at the very end. Modifying a file at an arbitrary offset is not possible.

  • Commodity hardware: Hadoop is designed to run on clusters of cheap commodity hardware. It does not require expensive specialized hardware. The chance of hardware failure in such situations is high but the system is expected to continue working correctly. Keeping in line with that view, HDFS is highly fault-tolerant and designed to be deployed on low-cost hardware.

Working of HDFS

A filesystem, distributed or local, must know the location of the disk blocks making up a file. Then be it can retrieve blocks for a client. Additionally, the filesystem should return any metadata related to the file to the client. These requirements inspire the two software daemons that make up HDFS:

  • Namenode (NN)
  • Datanode (DN)

Get hands-on with 1200+ tech skills courses.