Search⌘ K
AI Features

DAG of Stages in Apache Spark

Understand how Apache Spark builds and executes a DAG of stages for distributed data processing. Learn about narrow and wide dependencies, task scheduling based on data locality, fault tolerance mechanisms, and the role of checkpointing for faster recovery in large-scale cluster computing.

As explained in the previous lesson, the driver examines the lineage graph of the application code and builds a DAGDirected Acyclic Graph of stages to execute.

DAG scheduler of stages

A DAG of Stages is shown in the following illustration:

  • Each stage contains as many pipelined transformations with narrow dependencies (one-to-one) as possible.
  • The boundaries of each stage correspond to
...