Basics

This lesson introduces MapReduce paradigm to the reader.

Map and Reduce

MapReduce is a concatenation of, “map” and “reduce” which aptly describes the two phases it comprises. MapReduce is an implementation of the computing model introduced by Google. Here, data-parallel computations are executed on clusters of unreliable machines by certain systems. These systems automatically provide locality-aware scheduling, fault tolerance, and load balancing. In simpler terms, think of MapReduce similar as a divide and conquer strategy. A huge data set is divided among worker machines. Once processing is complete, the data from each machine is aggregated to present a final solution. The data flow in various phases of a MapReduce job is shown below.

Get hands-on with 1200+ tech skills courses.