Deep Dive into Hadoop and MapReduce
Learn more about the architecture, characteristics, and workflow of Hadoop.
Hadoop architecture
-
Master Node: Also called NameNode, the Master Node manages HDFS metadata, which includes mapping files to the list of blocks they contain.
Master Node has a single point of failure, which is a big downside. Besides this, the NameNode controls the read/write accesses to the files from clients, keeps track of the nodes in the cluster, the disk space of nodes, and whether or not a node is dead. It then uses this information to schedule block replications.
-
Data Servers: Also known as chunk servers, they are responsible for writing data to files when a client requests the Master Node. Depending on the replica number of the file, a write pipeline would be set up between that many data nodes. A write process would be considered successful when all replicas are successfully written, which ensures data consistency. Data nodes also send periodic block reports to the Master Node, which uses them to map blocks of a file to its locations.
-
Client: It talks to the Master Node and fetches the files from the data servers to run
map
andreduce
on them. The Master Node responds with multiple locations, from which the client can read data blocks from. The client chooses the nearest data node if there is more than one to avoid burdening the network.
Get hands-on with 1200+ tech skills courses.