System Design Deep Dive: Real-World Distributed Systems/

...

Parallel Operations in Spark

Get to know about the parallel operations in Spark.

We'll cover the following...

Transformations
Dependencies
Actions
Quiz

The main focus of this lesson is to know how we would perform operations on an RDD's workers in parallel to transform it into another RDD, and how we would extract information from these distributed datasets. Spark provides parallel operations solely for this purpose. The users don't have to extract or transform data from each worker separately. The Spark system applies each function simultaneously across all the workers in an RDD. Parallel operations can transform RDDs to get new RDDs. There are generally two types of operations we can perform on RDDs––transformations and actions.

Transformations

These are the operations applied on an RDD to get a new RDD. Transformations are lazy operations, i.e., they get executed only when an action is called. Instead of modifying the data immediately, Spark waits until action is called and builds an execution plan to make all the transformations run efficiently whenever they are executed, possibly pipelining many transformations. Since RDDs are immutable, the input RDD remains the same. Spark supports many transformations, such as map(), ﬂatMap(), mapValues(), ﬁlter(), groupByKey(), reduceByKey(), union(), join(), cogroup(), crossProduct(), sample(), partitionBy(), and sort(). Making another RDD from an RDD and then applying a transformation on it again to get another RDD makes a transformation chain or pipeline. Spark provides a graph-based representation for RDDs called a lineage graph to track the lineage of transformations.

The lineage graph shown below contains a series of transformations in MMA fights. First, UFC fights are filtered out from the data, then the winners of each fight are mapped with an integer 1. Finally, all the wins of each fighter are reduced to give out the total wins of each fighter.

Prologue

File Systems

Google File System (GFS)

Google Colossus File System

Facebook's Tectonic File System

Databases

Google Bigtable

Google Megastore

Google Spanner

Key-value Stores

Many-core Key-value Store

Scaling Memcache

SILT

Amazon DynamoDB

Concurrency Management

Two-phase Locking (2PL)

Google Chubby Locking Service

ZooKeeper

Big Data Processing: Batch to Stream Processing

MapReduce

Spark

Kafka

Consensus

Understanding Consensus: Two Generals, FLP, & Byzantine Generals

Two-phase Commit

State Machine Replication

Paxos

Raft

Epilogue

Parallel Operations in Spark

Transformations