Search⌘ K
AI Features

Basics

Explore the basics of MapReduce, a distributed computing model that processes big data using map and reduce phases. Understand how it divides tasks across clusters for scalable, fault-tolerant batch processing. Gain insight into its key-value input-output model and practical applications in big data analysis.

Map and Reduce

MapReduce is a concatenation of, “map” and “reduce” which aptly describes the two phases it comprises. MapReduce is an implementation of the computing model introduced by Google. Here, data-parallel computations are executed on clusters of unreliable machines by certain systems. These systems automatically provide locality-aware scheduling, fault tolerance, and load balancing. In simpler terms, think of MapReduce similar as a divide and conquer strategy. A huge data set is divided among worker machines. Once processing is complete, the data from each machine is aggregated to present a final solution. The data flow in various phases of a MapReduce job is shown below.

MapReduce is a programming model used to process large data sets on a cluster of commodity machines by using a distributed algorithm.

For all its strengths, MapReduce is fundamentally a batch processing system and is not suitable for interactive analysis. You can’t run a query and get results back quickly. Queries typically take minutes or more, so it’s best for offline use, when there ...