What is apache Hadoop?

Overview

With time, the need to store and process big data has increased as traditional frameworks and hardware proved to be insufficient to deal with the massive data surge. In 2008, Yahoo released Hadoop to the market. Hadoop is an open-source framework that is a powerhouse when dealing with big data. It provides storage in the form of distributed file systems and equips users to process data in parallel.

Hadoop has four main modules

Hadoop Common
Hadoop Distributed File System (HDFS)
Hadoop MapReduce
Hadoop Yet Another Resource Negotiator (YARN)

2. Hadoop Distributed File System (HDFS)

HDFS serves as the storage hub for Hadoop. HDFS abstracts numerous servers storing data into one giant warehouse of data. The abstraction makes the storing and fetching of large amounts of data extremely fluid. Big data is chunked out before it is stored in a distributed file system.

There are two main components to HDFS: NameNode and DataNode. NameNode serves as the master node that stores the location/addresses of the chunks of data stores in different machines/servers. DataNode, however, actually contains the chunked out data. The data in DataNode is replicated on several machines in order to serve as a backup in the event of a node failure.

3. MapReduce

The MapReduce module operates on two main worker nodes: Map and Reduce. On top of the Map and Reduce workers sits a MasterNode that schedules tasks on the workers. Map workers fetch raw data from the file system, organize it systematically in subsequent partitions, and pass it into a buffer. Reduce workers get the mapped data from the buffer and aggregate it according to a particular format given by the MasterNode. By the end of the procedure, the data is organized into an easily readable and manageable form.

4. Hadoop Yet Another Resource Negotiator (YARN)

YARN is Hadoop’s resource manager and job scheduler. YARN was introduced in Hadoop 2.0 to separate the resource management layer from the processing layer. YARN thrives on three main components: Client, Resource Manager, and Node Manager:

Client forwards jobs to the Resource Manager.
Resource Manager governs all the activities and serves as a master daemon to YARN. It first schedules tasks and allocates resources to the Node Manager and then it connects applications and monitors its health to restart it if the need arises.
Node Manager takes care of the individual node in the YARN architecture. It monitors health, manages resources, logs necessary metrics, and reports to the Resource Manager.