In this chapter, we will discuss the architecture of Apache Spark. This is an example of a distributed system that achieves one common goal.

In Spark architecture, there are multiple components that work together. These components consist of multiple processes or nodes. In most cases, Spark is deployed as a cluster of multiple nodes for big data processing. We will discuss the components and show how they interact with each other.

High-level Apache Spark architecture

In order to understand the architecture, we first need to know a few Spark-specific concepts.

Resilient Distributed Dataset (RDD)

RDD is a Spark abstraction for the data that is being processed. When we input some data in a Spark program, Spark reads the data and creates a RDD. Under the hood, Spark uses RDD to process the data and create results.

So what is an RDD? Let’s go over each part of the name.

  • Resilient: Fault-tolerant. If there is a failure, the data can be recovered seamlessly.

  • Distributed: Data is distributed among multiple nodes in a cluster.

  • Dataset: A partitioned set of data. Each partition is created based on some key programmatically provided by the developer.

In short, RDD is a collection of partitioned datasets that are fault-tolerant and distributed among multiple nodes in a cluster. The partitions are assigned to the nodes. The nodes can process the partitions independently.

RDD is a powerful abstraction. The greatest benefit for developers is that developers do not need to think of RDD as a distributed set of data. They can simply write synchronous-looking code that is probably calling some functions on the data. Under the hood, Spark takes care of the data management in parallel.

Driver node

Each Spark system has a driver node. The driver node runs the driver program.

When developers write code using Spark, the code creates the driver program. It contains the instructions for the driver node. The driver node creates proper plans and commands other nodes in the cluster to carry out specific tasks based on the instructions.

We’ll go over this in more detail in the next sections.

Worker nodes

There can be anywhere from a few to a few thousand worker nodes in a spark system. These worker nodes are responsible for doing all the heavy lifting on the assigned partitions. They simply follow the driver’s instructions, which are written by the developer in the code.

Cluster manager

The driver program works with a cluster manager. The cluster manager is responsible for managing the cluster of worker nodes. The manager program may run on the driver node—depending upon the configuration of the system.

How components interact in Spark

Get hands-on with 1200+ tech skills courses.