In this lesson, we will learn a popular algorithm that is used frequently to do batch processing on a huge volume of data. Google published this algorithm in 2004, and it was later adopted in many data processing systems, such as Apache Spark.

The MapReduce algorithm

We’ll first look at this algorithm with an example. First, let’s imagine the following scenario:

  • You have all the text of a piece of classic English literature.
  • You want to count the occurrence of each word in the whole text.
  • The data is stored in some persistent storage.
  • The data is so huge that it cannot be loaded in memory in one physical machine. This means you have to use multiple machines.

Given the above, let’s discuss an approach to count the words.

Step 1: split the data

First, the data must be split into chunks so that multiple machines can load individual chunks in themselves. As we mentioned, the data itself is so huge that it is way beyond the memory of a single machine. After the data is split, multiple machines can load the chunks and process them independently. Generally, these machines are the mapper machines that load one or more of the individual chunks and run the next step of the algorithm.

In our example, a chunk from the English literature is just a file containing a portion of the literature.

Step 2: map

This is the first core part of the MapReduce algorithm. In this step, each mapper machine will map an individual item from its loaded chunks and run it through the mapper function. The result is generally a key-value pair.

In our case, each machine will load a few chunks, pick one word from a chunk, and run it through a mapper function. But what would the mapper function look like?

The most obvious choice is to create a key-value pair, where the key is the word itself, and the value is 1.

If the input to the mapper is "work", then the mapper will return ("work", 1).

Step 3: Local aggregation

In this optional step, a mapper machine will aggregate the key-value pairs locally. For example, assume we have two mappers, and there are two chunks of data loaded into the mappers.

Mapper 1: work play good work
Mapper 2: good bad play play work

After mapping, we get

Mapper 1: (work, 1), (play, 1), (good, 1), (work, 1)
Mapper 2: (good, 1), (bad, 1), (play, 1), (play, 1), (work, 1)

Now in each mapper, the key-value pairs can be aggregated based on the keys. So we get

Mapper 1: (work, 2), (play, 1), (good, 1)
Mapper 2: (good, 1), (bad, 1), (play, 2), (work, 1)

Step 4: Shuffle and sort

This step is the beginning of the reduction part of the MapReduce algorithm.

In this step, the key-value pairs from step 3 are sent to reducer machines by using the key. As a result, the data is shuffled across different machines, and all data with the same key is sent to the same reducer. The data received inside a reducer is then sorted by the key.

For example, assume we have two reducer machines. We shuffle the output from the previous step using the word as the key. Assuming we have two reducers, after shuffling and sorting, we get

Reducer 1: (play, 1), (play, 2), (work, 2), (work, 1)
Reducer 2: (bad, 1), (good, 1), (good, 1)

Note how the data in each reducer is sorted by the key. For English words, this could be as simple as the lexicographic order.

Step 5: Reduce

Each reducer will aggregate the data based on the key and create the final result in this final step. For our example, we get

Reducer 1: (play, 3), (work, 3)
Reducer 2: (bad, 1), (good, 2)

MapReduce visualization

The following diagram shows the whole flow.

Get hands-on with 1200+ tech skills courses.