System Design Deep Dive: Real-World Distributed Systems/

...

Shared Variables in Spark

Learn how Spark makes data sharing and information gathering efficient.

We'll cover the following...

Shared variables
- Broadcast variables
  - Implementation of broadcast variables
- Accumulators
  - Implementation of accumulators

In addition to RDDs, Spark's second abstraction is distributed shared variables. We might want to send static data to all the workers (driver-to-worker information flow) or might want to collect some state from all the workers (workers-to-driver information flow). Spark's shared variable abstraction helps with both of these scenarios.

Shared variables

Setup work is required for some operations, like creating a random number from a specific distribution, for each partition. The user will have to create and send it to the worker with specific partitions every time a task is run on it. Shared variables are used to help cater to the setup overhead. Shared variables can be used to add together data from all tasks or save a large ...

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Shared Variables in Spark

Shared variables