Trusted answers to developer questions

Apache Spark architecture

Get Started With Machine Learning

Learn the fundamentals of Machine Learning with this free course. Future-proof your career by adding ML skills to your toolkit — or prepare to land a job in AI or Data Science.

Abstractions

Before diving into the actual architecture of Spark, it is important to note the two most significant abstractions that Spark uses for data management.

1. Resilient Distributed Datasets (RDDs)

RDD is a collection of records/datasets that are used by executors for computations. Datasets stored in RDD can be objects of different languages including Python, Scala, or Java. These datasets are immutable and:

Resilient: fault-tolerant
Distributed: spread over multiple nodes

2. Direct Acrylic Graph (DAG)

DAG allows Spark to create a sequence of events of tasks. DAG is a set of vertices and edges, where vertices represent the RDDs and the edges represent the Operation to be applied on RDD. In Spark DAG, every edge directs to a different area in the sequence.

Spark uses the master/slave architecture, which is dependent on two daemons:

Master Node
Worker Nodes

A Cluster Manager binds these daemons together.

Master Node

Master Node is the hub of management for Spark. It runs the main() through SparkContext. SparkContext is your asset, which allows you to use all the Spark functionalities. The Driver Program contains various components such as DAGScheduler, TaskScheduler, BackendScheduler, and BlockManager. The Driver Program communicates with the Cluster Manager and schedules tasks for different processes. A job is split into multiple tasks that are then distributed over the worker node. Anytime an RDD is created in the SparkContext, it can be distributed across various nodes and cached there.

Worker Node

Worker Nodes handle the execution of tasks scheduled to it by the Masker Node. There is only one Master Node, but there are multiple Worker Nodes. The Executors carry out all the computation on the RDD partitions and in the Worker Node – the results are returned to the SparkContext.

It is important to note that you can increase the number of workers, which will allow you to divide jobs into more partitions and execute them, in parallel, over multiple systems. With the increase in number of workers, the memory size will also increase, which will allow you to cache jobs and execute them faster.

RELATED TAGS

architecture

apache

spark

spark 3

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments