Hadoop Ecosystem
Explore the Hadoop ecosystem and learn about its core components such as HDFS for distributed storage and YARN for resource management. Understand how tools like Apache Pig, Hive, Mahout, and others support big data processing, querying, and machine learning in large-scale environments.
We'll cover the following...
What is Hadoop?
Hadoop is an open source software that involves solving big data problems using large clusters of hardware. It efficiently stores and processes big data across big clusters. The idea of Hadoop came from a MapReduce paper proposed by Google. Hadoop is developed in the Java programming language.
Components of Hadoop
While setting up a Hadoop cluster for big data processing, two services are mandatory:
-
HDFS (Hadoop Distributed File System) for storing data.
-
YARN (Yet Another Resource Negotiator) for processing the data in the HDFS.
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System consists of name nodes and data nodes.
Name node
It is the primary node that keeps track of all the data nodes in the Hadoop cluster. It records the metadata of the ...