This lesson introduces the implementation of the Reduce phase of a MapReduce job.


Let’s look at the reducer phase for creating a MapReduce job. The reduce tasks work on the intermediate input produced by the map tasks. The reduce tasks are completely independent of each other just like the mapper tasks; they do not communicate. However, the reducer tasks require intermediate key/value pairs produced by the mapper tasks as input. This communication, is facilitated by the Hadoop framework, and doesn’t require user intervention.

Reducer refers to a node that runs the reducer task. Each reducer processes data in its assigned partition. The map tasks partition their output so that one partition can be assigned to one reduce task.

Note that all records for a given key reside in a single partition, allowing a single reduce task to process all data for a given key.

Moreover, partitions are only created when the number of reducers is greater than one. By default, the hash of the key decides which partition the key should go to. We can also specify a custom partition scheme. A single reducer task invokes the reduction method for all keys in its assigned partition. The reduction function receives the key and an iterator over the list of values associated with that same key.

Get hands-on with 1200+ tech skills courses.