Apache Spark architecture

About Apache Spark

Apache Spark is a distributed, open-source, general-purpose, cluster-computing framework. It is the largest open-source project in data processing. Spark promises excellent performance and comes packaged with higher-level libraries including support for SQL queries, streaming data, machine learning, and graph processing.

Let’s observe how Spark does its job by looking at its architecture.

Abstractions

Before diving into the actual architecture of Spark, it is important to note the two most significant abstractions that Spark uses for data management.

1. Resilient Distributed Datasets (RDDs)

RDD is a collection of records/datasets that are used by executors for computations. Datasets stored in RDD can be objects of different languages including Python, Scala, or Java. These datasets are immutable and:

Resilient: fault-tolerant
Distributed: spread over multiple nodes

2. Direct Acrylic Graph (DAG)

DAG allows Spark to create a sequence of events of tasks. DAG is a set of vertices and edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD. In Spark DAG, every edge directs to a different area in the sequence.

Spark uses the master/slave architecture, which is dependent on two daemons:

Master Node
Worker Nodes

A Cluster Manager binds these daemons together.

Master Node

Master Node is the hub of management for Spark. It runs the main() through SparkContext. SparkContext is your asset, which allows you to use all the Spark functionalities. The Driver Program contains various components such as DAGScheduler, TaskScheduler, BackendScheduler, and BlockManager. The Driver Program communicates with the Cluster Manager and schedules tasks for different processes. A job is split into multiple tasks that are then distributed over the worker node. Anytime an RDD is created in the SparkContext, it can be distributed across various nodes and cached there.

Worker Node

Worker Nodes handle the execution of tasks scheduled to it by the Masker Node. There is only one Master Node, but there are multiple Worker Nodes. The Executors carry out all the computation on the RDD partitions and in the Worker Node – the results are returned to the SparkContext.

It is important to note that you can increase the number of workers, which will allow you to divide jobs into more partitions and execute them, in parallel, over multiple systems. With the increase in number of workers, the memory size will also increase, which will allow you to cache jobs and execute them faster.