Spark Application Lifecycle

Learn the various stages through which a Spark application goes through when executed by the Spark processing engine.

We'll cover the following

Application life cycle

In this lesson, we'll track the various phases a Spark application goes through from the time it is submitted to when it is marked as completed.

  1. As a user, the first step is to submit the Spark job to the cluster. Usually, this involves the user running the spark-submit command in a terminal window. The command spawns a process that talks to the cluster manager. If YARN is being used as the cluster management software, then the client process connects to the Resource Manager (RM) daemon. If the job is accepted, the RM will create the Spark driver process on one of the machines in the cluster.

  2. Once the driver process starts running, it executes the user code. The code must establish a SparkSession which in turn sets up the Spark cluster. The driver process and the executor processes are collectively referred to as the Spark cluster. We’ll study SparkSession in-depth in upcoming lessons, but for now it is sufficient to know that SparkSession is a unified single point of entry to interact with underlying Spark functionality and allows for programming Spark with DataFrame and Dataset APIs. The SparkSession will talk to the cluster manager daemon, which will be RM in our case, to launch Spark executor processes on worker nodes.

  3. The RM (cluster manager) will launch Spark executor processes on nodes across the cluster and return the location of executor processes to the driver process. The Spark cluster is setup at this point and the driver can communicate directly with the executor processes.

  4. Once the Spark cluster is set up, the driver assigns tasks to executor processes and job execution begins. Data may be moved around and executors report on their status to the driver.

  5. The driver exits when the Spark job completes and the cluster manager shuts down the executor processes on behalf of the driver. The cluster manager can be queried for the success or failure of the job.

Demonstration

Now that we have learned more about Spark, let’s see how it works in practice with an example. In this course, we’ll be working with datasets related to Bollywood movies. The dataset can be downloaded for free from Kaggle. Execute the commands listed in the widget below and observe the results.

Get hands-on with 1000+ tech skills courses.