High-level Design of Spark

Understand the high-level design of Spark by exploring its key components such as resilient distributed datasets, driver and worker nodes. Learn how Spark optimizes data processing through in-memory computation, task scheduling, and the use of lineage graphs to enhance scalability and fault tolerance.

We'll cover the following...

Building blocks
Programming model
- Spark programming interface

Building blocks

Building blocks of Spark include resilient distributed datasets, driver, and worker nodes. The details of these components have been described briefly in this lesson.

Resilient distributed datasets (RDDs)

They are an abstraction, a read-only collection of resilient objects stored across a cluster of machines.
RDDs can be created in two ways––by applying transformation on an existing RDD or by reading data from a distributed file system.
- Whenever an RDD is created, it has partitions of data in it.
- Those partitions are saved on a cluster of machines.
For example, let's say an RDD is initially created from a file, then a subsequent RDD is created from that RDD, and so on.
Spark will keep a graph that records the sources of all the RDDs called a lineage graph.

1.Prologue

2.File Systems

3.Google File System (GFS)

4.Google Colossus File System

5.Facebook's Tectonic File System

6.Databases

7.Google Bigtable

8.Google Megastore

9.Google Spanner

10.Key-value Stores

11.Many-core Key-value Store

12.Scaling Memcache

13.SILT

14.Amazon DynamoDB

15.Concurrency Management

16.Two-phase Locking (2PL)

17.Google Chubby Locking Service

18.ZooKeeper

19.Big Data Processing: Batch to Stream Processing

20.MapReduce

21.Spark

22.Kafka

23.Consensus

24.Understanding Consensus: Two Generals, FLP, & Byzantine Generals

25.Two-phase Commit

26.State Machine Replication

27.Paxos

28.Raft

29.Epilogue

High-level Design of Spark

Building blocks

Resilient distributed datasets (RDDs)