Design Refinements in MapReduce: Part II

Discover how to improve MapReduce systems through refinements like status pages for real-time job monitoring, customizable counters for event tracking, mechanisms to skip bad records that cause crashes, and local execution for debugging and testing. This lesson helps you enhance fault tolerance, error handling, and job transparency in large-scale distributed processing.

We'll cover the following...

Status information
- Status pages
Counters
- Process of accumulating the counter output
- Applying the counters
Skipping bad records
- Process of skipping bad records
Local execution

We can incorporate the following refinements to get insights into our system’s status and performance, along with error handling mechanisms and debugging facilities. All of these refinements are supplementary to the previously covered refinements and augment the overall efficiency of the design.

Status information

Even with all the distribution and parallelization, the MapReduce job is a time-taking process. For example, the best Hadoop (an open source implementation of Google’s MapReduce library) performance to date for processing 102.5 TB dataSource: sortbenchmark.org is 4,328 seconds (1.2 hours), achieved by Thomas Graves of Yahoo! Inc. He used the following configuration for this task: 2100 nodes (each node had: 2 2.3Ghz hex-core Xeon E5-2630, 64 GB memory, 12x3TB disks).

It’s beneficial for the users to access the status of their MapReduce jobs to get insights and make crucial decisions in case any modifications are required.

Status pages

The manager houses an internal HTTP server and provides users access to a set of status pages. These status pages present the computation progress, such as the number of completed tasks, the number of in-progress tasks, input data size, intermediate data size, output data size, processing ...

1.Prologue

2.File Systems

3.Google File System (GFS)

4.Google Colossus File System

5.Facebook's Tectonic File System

6.Databases

7.Google Bigtable

8.Google Megastore

9.Google Spanner

10.Key-value Stores

11.Many-core Key-value Store

12.Scaling Memcache

13.SILT

14.Amazon DynamoDB

15.Concurrency Management

16.Two-phase Locking (2PL)

17.Google Chubby Locking Service

18.ZooKeeper

19.Big Data Processing: Batch to Stream Processing

20.MapReduce

21.Spark

22.Kafka

23.Consensus

24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals

25.Two-phase Commit

26.State Machine Replication

27.Paxos

28.Raft

29.Epilogue

Design Refinements in MapReduce: Part II

Status information

Status pages