High-level Design of Spark

Building blocks

Building blocks of Spark include resilient distributed datasets, driver, and worker nodes. The details of these components have been described briefly in this lesson.

Resilient distributed datasets (RDDs)

  • They are an abstraction, a read-only collection of resilient objects stored across a cluster of machines.

  • RDDs can be created in two ways––by applying transformation on an existing RDD or by reading data from a distributed file system.

    • Whenever an RDD is created, it has partitions of data in it.

    • Those partitions are saved on a cluster of machines.

  • For example, let's say an RDD is initially created from a file, then a subsequent RDD is created from that RDD, and so on.

  • Spark will keep a graph that records the sources of all the RDDs called a lineage graph.

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.