Apache Spark is a data processing system that was initially developed at the University of California by Zaharia et al.,M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster Computing with Working Sets,” Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, 2010. and then donated to the Apache Software Foundation.

Note that Apache Spark was developed in response to some of the limitations of MapReduce…

Limitation of MapReduce

The MapReduce model allowed developing and running embarrassingly parallel computations on a big cluster of machines. Still, every job had to read the input from the disk and write the output to the disk. As a result, there was a lower bound in the latency of job execution, which was determined by disk speeds. So the MapReduce was not a good fit for:

  • Iterative computations, where a single job was executed multiple times or data were passed through multiple jobs.
  • Interactive data analysis, where a user wants to run multiple ad hoc queries on the same dataset.

Note that Spark addresses the above two use-cases.

Foundation of Spark

Spark is based on the concept of Resilient Distributed Datasets (RDD).

Resilient Distributed Datasets (RDD)

RDD is a distributed memory abstraction used to perform in-memory computations on large clusters of machines in a fault-tolerant way. More concretely, an RDD is a read-only, partitioned collection of records.

RDDs can be created through operations on data in stable storage or other RDDs.

Types of operations performed on RDDs

The operations performed on an RDD can be one of the following two types:

Transformations

Transformations are lazy operations that define a new RDD. Some examples of transformations are map, filter, join, and union.

Actions

Actions trigger a computation to return a value to the program or write data to external storage. Some examples of actions are count, collect, reduce, and save.

Get hands-on with 1200+ tech skills courses.