Design Refinements in MapReduce: Part I

Explore advanced refinements in MapReduce design that improve data processing efficiency and scalability. Understand how to customize input/output types, optimize partitioning functions for balanced workloads, use combiners to reduce network bandwidth, and ensure deterministic and idempotent MapReduce tasks to maintain result consistency.

We'll cover the following...

Input and output types
- Input types
- Output types
Partitioning function
- Customization of the partitioning function
The Combiner function
- Comparison of the Combiner function and the Reduce function
Guaranteed ordering
Side effects
- Restrictions on the side-effects

Real-world systems are rarely designed in one go—it often takes many iterations to improve the design. As initial versions of our system are deployed in production, we get usage data and possibly new insights. In this and the next lesson, we will improve many aspects of MapReduce design.

Ordering our refinements goes along with the execution flow of the system.

Input and output types

Let’s analyze the supported input and output types by the MapReduce library.

Input types

By default, the MapReduce library supports reading a limited set of various input data types. Each input type implementation automatically handles the data splitting into meaningful ranges for further processing by the Map tasks.

Example

As we know, the data gets partitioned into key-value pairs before it is processed by the Map tasks. The “text” mode input processes each line as a key-value pair, such that:

The key is an offset in the input file.
The value is the content of that line.

This mode ensures that the partitioning happens only at the line boundaries.

Support for new input types

Based on the desired functionality, the users can also define a new reader interface to add functionality for a new input type. For example, we can define a reader to read data from a database or a memory-mapped data structure.

Output types

The MapReduce library also supports various output types by default, and similar to the input types, it also gives the functionality to define new output types.

Using custom types for data is a powerful extension that enables end programmers to read and write data from many different sources and sinks.

1.Prologue

2.File Systems

3.Google File System (GFS)

4.Google Colossus File System

5.Facebook's Tectonic File System

6.Databases

7.Google Bigtable

8.Google Megastore

9.Google Spanner

10.Key-value Stores

11.Many-core Key-value Store

12.Scaling Memcache

13.SILT

14.Amazon DynamoDB

15.Concurrency Management

16.Two-phase Locking (2PL)

17.Google Chubby Locking Service

18.ZooKeeper

19.Big Data Processing: Batch to Stream Processing

20.MapReduce

21.Spark

22.Kafka

23.Consensus

24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals

25.Two-phase Commit

26.State Machine Replication

27.Paxos

28.Raft

29.Epilogue

Design Refinements in MapReduce: Part I

Input and output types

Input types

Output types

Partitioning function