Trusted answers to developer questions

MapReduce

Get Started With Machine Learning

Learn the fundamentals of Machine Learning with this free course. Future-proof your career by adding ML skills to your toolkit — or prepare to land a job in AI or Data Science.

MapReduce is a framework developed by Google to handle large amounts of data in a timely and efficient manner. One of the most famous software frameworks that uses MapReduce is Apache Hadoop MapReduce.

widget

MapReduce takes advantage of numerous servers where data can be distributed and managed. Like every good framework, MapReduce provides abstractions to underlying processes happening during the execution of user commands. A few of these processes include fault tolerance, partitioning data, and aggregating data. The abstractions let the user focus on the high-level logic of the program while trusting that the framework will smoothly continue the processes under-the-hood.

How it works

The workflow that MapReduce follows is:

  • Partitioning
  • Map
  • Intermediate Files
  • Reduce
  • Aggregate
svg viewer

There are several Map Workers and Reduce Workers, but there is only one Master Node. The Master Node tells the Map and Reduce Workers what to do.

Partitioning

The data is usually in the form of a big chunk. It is necessary to, first, partition the data into smaller, more manageable pieces that can be efficiently handled by the map workers.

svg viewer

Map

Map Workers receive the data in the form of a <key, value> (key is filename and value is content) pair. This data is processed by the Map Workers, according to the user-defined Map Function, to generate intermediate <key, value> pairs.

svg viewer

Intermediate Files

The data is partitioned into R partitions (R is the number of Reduce Workers). These files are buffered in the memory until the Master Node forwards them to the Reduce Workers.

svg viewer

Reduce

As soon as the Reduce Workers get the data stored in the buffer, they sort it accordingly and group data with the same keys.

svg viewer

Aggregate

The Master Node is notified when the Reduce Workers are done with their tasks. In the end, the sorted data is aggregated together and R output files are generated for the user.

RELATED TAGS

framework
abstraction
data
big data
Copyright ©2024 Educative, Inc. All rights reserved
Did you find this helpful?