Introduction to Big Data Processing Systems

Discover the fundamentals of big data processing by examining key systems such as MapReduce, Spark, and Kafka. Learn how these systems address large-scale data challenges by enabling efficient batch processing, low-latency operations, and real-time data streaming. Understand their use cases, trade-offs, and the underlying principles that support modern distributed data processing.

We'll cover the following...

Motivation
What we will learn
Why did we choose these systems?

Motivation

It might not be an understatement to say that data runs our world. From calculating accurate travel times for a map allocation by taking dynamic traffic information into account to personalized recommendations for pretty much all the services, such as shopping, list of songs, etc., it is data that needs to be harnessed to get the right information.

What we will learn

We have selected three big data processing papers to discuss in the following few chapters:

[MapReduce] Dean, Jeffrey, and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. OSDI’04: Sixth Symposium on Operating System Design and Implementation (2008): pp. 137-150. ...

1.Prologue

2.File Systems

3.Google File System (GFS)

4.Google Colossus File System

5.Facebook's Tectonic File System

6.Databases

7.Google Bigtable

8.Google Megastore

9.Google Spanner

10.Key-value Stores

11.Many-core Key-value Store

12.Scaling Memcache

13.SILT

14.Amazon DynamoDB

15.Concurrency Management

16.Two-phase Locking (2PL)

17.Google Chubby Locking Service

18.ZooKeeper

19.Big Data Processing: Batch to Stream Processing

20.MapReduce

21.Spark

22.Kafka

23.Consensus

24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals

25.Two-phase Commit

26.State Machine Replication

27.Paxos

28.Raft

29.Epilogue

Introduction to Big Data Processing Systems

Motivation

What we will learn