Search⌘ K
AI Features

Anatomy of a Spark Application

Explore the detailed structure of a Spark application by learning how it executes jobs, stages, and tasks in parallel. Understand the difference between transformations and actions, Spark's optimization strategies like pipelining and shuffle persistence, and how the Spark Web UI helps monitor job execution. This lesson equips you to grasp Spark's core execution workflow and performance features in Big Data environments.

We'll cover the following...

Anatomy a Spark Application

In this lesson, we’ll formally look at various components of a Spark job. A Spark application consists of one or several jobs. But a Spark job, unlike MapReduce, is much broader in scope. Each job is made of a directed acyclic graph of stages. A stage is roughly equivalent to a map or reduce phase in MapReduce. A stage is split into tasks by the Spark runtime and executed in parallel on partitions of an RDD across the cluster. The relationship among these various concepts is depicted below:

A single Spark application can run one or more Spark jobs serially or in parallel. Cached RDDs output from one job can be made available to a second without requiring disk I/O in between. This makes certain computations extremely fast. A job always executes in the context of a Spark application. The spark-shell is an instance of a Spark application.

Let’s see an example to better understand jobs, stages and tasks. Consider the example below; it creates two DataFrames, each consisting of integers from 0 to 9. Next, we transform one of the DataFrames to consist of multiples of 3 by multiplying each ...