Building blocks

Building blocks of Spark include resilient distributed datasets, driver, and worker nodes. The details of these components have been described briefly in this lesson.

Resilient distributed datasets (RDDs)

They are an abstraction, a read-only collection of resilient objects stored across a cluster of machines.
RDDs can be created in two ways––by applying transformation on an existing RDD or by reading data from a distributed file system.
- Whenever an RDD is created, it has partitions of data in it.
- Those partitions are saved on a cluster of machines.
For example, let's say an RDD is initially created from a file, then a subsequent RDD is created from that RDD, and so on.
Spark will keep a graph that records the sources of all the RDDs called a lineage graph.

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

High-level Design of Spark

Building blocks

Resilient distributed datasets (RDDs)