High-level Design of Spark
Understand the high-level design of Spark by exploring its key components such as resilient distributed datasets, driver and worker nodes. Learn how Spark optimizes data processing through in-memory computation, task scheduling, and the use of lineage graphs to enhance scalability and fault tolerance.
Building blocks
Building blocks of Spark include resilient distributed datasets, driver, and worker nodes. The details of these components have been described briefly in this lesson.
Resilient distributed datasets (RDDs)
They are an abstraction, a read-only collection of resilient objects stored across a cluster of machines.
RDDs can be created in two ways––by applying transformation on an existing RDD or by reading data from a distributed file system.
Whenever an RDD is created, it has partitions of data in it.
Those partitions are saved on a cluster of machines.
For example, let's say an RDD is initially created from a file, then a subsequent RDD is created from that RDD, and so on.
Spark will keep a graph that records the sources of all the RDDs called a lineage graph.
RDDs implement an interface that keeps the following details:
A list of partition objects that contains their own sets of data
An iterator that traverses the data in a partition
A list of worker nodes where the ...