Introduction to Spark

We start the course by looking at Spark fundamentals and then dig into a high level overview of its architecture. Both topics will give us a clear picture of the platform’s nature and main components.

The introduction provides a clear picture of why Spark is natively designed to work in a parallel and distributed manner. These concepts will be used later to understand the structure of a Spark application and how it runs on a cluster.

The Java API and the DataFrame

After the introduction, we use the Spark Java API to work with the central abstraction that represents a collection of distributed data: the DataFrame.

We learn through code snippets and their analysis to understand how this abstraction simplifies data processing and encapsulates parallel execution.

DataFrame basic operations

Once acquainted with the DataFrame abstraction, we learn two different types of operations, actions and transformations, that we can use to manipulate data or change the structure or schema of a DataFrame, respectively.

We also touch on some theory of the internals of Spark execution, and the nature of both actions and transformations.

Advanced DataFrame operations

Next, the course covers advanced DataFrame operations, hightlights their use cases, and explains the concept of data shuffling.

This part of the course also explains the accumulators and shared variables constructs that are available in Spark to share information in a distributed runtime environment.

To provide an idea of practical applications, we will teach these somewhat advanced topics as solutions to fictitious problems. To ease understanding, these problems are phrased as business requirements in human-readable terms.

Input, output, and SparkSQL

Almost all applications’s cornerstone operations are to ingest and produce information. Spark, too, comes with out-of-the-box features to process both data input and output. We learn these in detail.

SparkSQL is a component that allows us to query data as with any other relational database. This is also a handy feature that we learn about.

Building a big data batch application

The batch application template that we work with in this course binds all the previous parts together. It offers a meaningful understanding of how all the concepts, tools, and techniques we learn work together in real-life scenarios.

We analyze the design, architecture, and implementation of batch jobs, which are core components of the batch application template, and mostly act as the main processing units for batch processing solutions.

We provide closure to this chapter by learning how to test Spark code.

Deployment and cluster execution

In this chapter, we describe all the basic steps to run a Spark application in local or standalone mode, and on a cluster hosted in the cloud. We do this by looking at the required instructions, commands, and necessary configurations for whatever services the application execution might need.

Monitoring and performance fundamentals

The last part of the course covers monitoring and performance fundamentals. It provides recipes on troubleshooting errors by interpreting Spark logs, and avoiding or sorting performance bottlenecks in Spark applications. These recipes offer guidance on efficient application development.

We also look at the SparkUI, a handy tool that allows us to track Spark Jobs, applications executions, and resources through a user interface.