...

Introduction to Apache Spark

Let's have an introduction to Apache Spark and its architecture.

We'll cover the following...

Limitation of MapReduce
Foundation of Spark
- Resilient Distributed Datasets (RDD)
  - Types of operations performed on RDDs
    - Transformations
    - Actions
- Creating an RDD
Architecture of Spark
RDD operations

Apache Spark is a data processing system that was initially developed at the University of California by Zaharia et al.,M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster Computing with Working Sets,” Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, 2010. and then donated to the Apache Software Foundation.

Note that Apache Spark was developed in response to some of the limitations of MapReduce…

Limitation of MapReduce

The MapReduce model allowed developing and running embarrassingly parallel computations on a big cluster of machines. Still, every job had to read the input from the disk and write the output to the disk. As a result, there was a lower bound in the latency of job execution, which was determined by disk speeds. So the MapReduce was not a good fit for:

Iterative computations, where a single job was executed multiple times or data were passed through multiple jobs.
Interactive data analysis, where a user wants to run multiple ad hoc queries on the same dataset.

Note that Spark addresses the above two use-cases.

Foundation of Spark

Spark is based on the concept of Resilient Distributed Datasets (RDD).

Resilient Distributed Datasets (RDD)

RDD is a distributed memory abstraction used to perform in-memory computations on large clusters of machines in a fault-tolerant way. More concretely, an RDD is a ...

Before Getting Started

Introduction to Distributed Systems

Basic Concepts and Theorems

Distributed Transactions

Achieving Isolation

Achieving Atomicity

Concluding Distributed Transactions

Consensus

Time

Order

Networking

Security

Security Protocols

From Theory to Practice

Case Study 1: Distributed File Systems

Case Study 2: Distributed Coordination Service

Case Study 3: Distributed Data Stores

Case Study 4: Distributed Messaging System

Case Study 5: Distributed Cluster Management

Case Study 6: Distributed Ledger

Case Study 7: Distributed Data Processing Systems

Practices & Patterns

Communication Patterns

Coordination Patterns

Data Synchronization

Shared-nothing Architectures

Distributed Locking

Compatibility Patterns

Dealing with Failure

Distributed Tracing

Concluding this Course

Introduction to Apache Spark

Limitation of MapReduce

Foundation of Spark

Resilient Distributed Datasets (RDD)