spark.tar.gz

SparkShellUI

SparkHistoryServerUI

Spark has come to dominate the big data processing space in a short span of time since its release and now serves as the de-facto unified big data processing engine in the industry. 

In this course, you will get a complete introduction to the basics of Spark. You will start by learning about the architecture, the application lifecycle, and its API.

From there, you will dive into the data frame data structure and its API as well as the strongly-typed datasets API. Lastly, you’ll get into the Spark SQL engine which will allow you to issue queries on structured data with a schema.

By the end of this course, you will have the confidence to use Spark in any of your big data projects.

An Introduction to Spark

## Spark design 

Spark is a distributed parallel data-processing framework and bears many similarities to the traditional MapReduce framework. Spark has the same leader-worker architecture as MapReduce, the leader process coordinates and distributes work to be performed among work processes. These two kinds of processes are formally called the driver and the executor.
 

## Driver
The driver is the leader process that manages the execution of a Spark job. It is responsible for maintaining the overall state of the Spark application, responding to a user's program or input and analyzing, distributing and scheduling work among executor processes. The driver process is in essence the heart of the Spark application and maintains all application related information during an application's lifetime.

Spark Driver converts Spark operations into DAG computations and schedules and distributes them as tasks across the Spark executors. The Spark Driver accesses the distributed components in the cluster, including the executors and the cluster manager, via the ***SparkSession***. You can consider the SparkSession to be a single point of entry and access to all Spark operations and data. Through SparkSession we can read from data sources, write DataFrames or Datasets, create runtime JVM params, etc. In essence, SparkSession is the unified conduit to all of Spark functionality. If we are using the interactive spark-shell, the Spark driver instantiates the SparkSession for us, whereas if we are in a Spark application, we'll create the SparkSession ourselves. We'll look at examples of both in the lessons ahead.


## Executor
Executors are the worker processes that execute the code assigned to them by the driver process and report the state of the computation on that executor back to the driver. Once the resources have been allocated, the Driver directly communicates with the executors. In most deployment modes a single executor runs per node. Spark executors are assigned tasks that require working on a subset of data located closest to them in the cluster. Working on data in close proximity is referred to as ***data locality*** and helps reduce the consumption of network bandwidth.

# Spark design 

Spark is a distributed parallel data-processing framework and bears many similarities to the traditional MapReduce framework. Spark has the same leader-worker architecture as MapReduce, the leader process coordinates and distributes work to be performed among work processes. These two kinds of processes are formally called the driver and the executor.
 

# Driver
The driver is the leader process that manages the execution of a Spark job. It is responsible for maintaining the overall state of the Spark application, responding to a user's program or input and analyzing, distributing and scheduling work among executor processes. The driver process is in essence the heart of the Spark application and maintains all application related information during an application's lifetime.

Spark Driver converts Spark operations into DAG computations and schedules and distributes them as tasks across the Spark executors. The Spark Driver accesses the distributed components in the cluster, including the executors and the cluster manager, via the ***SparkSession***. You can consider the SparkSession to be a single point of entry and access to all Spark operations and data. Through SparkSession we can read from data sources, write DataFrames or Datasets, create runtime JVM params, etc. In essence, SparkSession is the unified conduit to all of Spark functionality. If we are using the interactive spark-shell, the Spark driver instantiates the SparkSession for us, whereas if we are in a Spark application, we'll create the SparkSession ourselves. We'll look at examples of both in the lessons ahead.


# Executor
Executors are the worker processes that execute the code assigned to them by the driver process and report the state of the computation on that executor back to the driver. Once the resources have been allocated, the Driver directly communicates with the executors. In most deployment modes a single executor runs per node. Spark executors are assigned tasks that require working on a subset of data located closest to them in the cluster. Working on data in close proximity is referred to as ***data locality*** and helps reduce the consumption of network bandwidth.

Get insights on the architecture of Spark.

Spark Overview

DataFrames

Datasets

Spark SQL

Summary

Architecture

Spark design

Driver

Executor