Introduction to Distributed Systems for Dummies/

...

MapReduce in Batch Processing

Learn about the MapReduce programming model.

We'll cover the following...

The MapReduce algorithm
MapReduce visualization
How to run batch processing
Sample code
Key-takeaways

In this lesson, we will learn a popular algorithm that is used frequently to do batch processing on a huge volume of data. Google published this algorithm in 2004, and it was later adopted in many data processing systems, such as Apache Spark.

The MapReduce algorithm

We’ll first look at this algorithm with an example. First, let’s imagine the following scenario:

You have all the text of a piece of classic English literature.
You want to count the occurrence of each word in the whole text.
The data is stored in some persistent storage.
The data is so huge that it cannot be loaded in memory in one physical machine. This means you have to use multiple machines.

Given the above, let’s discuss an approach to count the words.

Step 1: split the data

First, the data must be split into chunks so that multiple machines can load individual chunks in themselves. As we mentioned, the data itself is so huge that it is way beyond the memory of a single machine. After the data is split, multiple machines can load the chunks and process them independently. Generally, these machines are the mapper machines that load one or more of the individual chunks and run the next step of the algorithm. ...

Introduction

What Distributed Systems Achieve for Us

Data in Distributed Systems

Communication Between Nodes

Data Processing in Large Scale

Distributed System Architectural Patterns

Case Study 1: Apache Spark

Case Study 2: Apache Druid

Conclusion

MapReduce in Batch Processing

The MapReduce algorithm

Step 1: split the data