tarball.tar.gz

DataFrameBasicsRun

SparkDataframeExampleDB

This course serves as a comprehensive introduction to the Spark Java API. Experienced Java developers will use object-oriented programming (OOP) principles to apply theory to Apache Spark and big data practice.

You’ll learn the basic components and architecture of Spark, a leading framework for building big data applications, before implementing them in Java. You’ll also explore data transformations like grouping, sorting, and joining. Further, you’ll learn to support SQL operations in the database and create a big data and batch template application with Java.

By the end of the course, you’ll be familiar with Apache Spark and know how to integrate big data with Java environments through the Spark Java API. You’ll wrap up by learning about monitoring and support functions for a live Spark Java environment.

Combining a leading big data framework with a leading programming language, this course will empower you to work efficiently with large volumes of data, and process at scale and speed.

Mastering Big Data with Apache Spark and Java

## Sharing data in a cluster

Sharing data in a distributed environment, regardless of the use case, can be confusing.

Understanding the scope (where the variables “live”) and lifecycle (how the values change) of shared variables while executing code in a cluster presents itself as a challenging task.


Within the Spark ecosystem, variables can be passed down to objects that operate in a distributed fashion. Still, these are copies with a different state each while execution takes place.

Furthermore, this is a one-way type of communication, meaning that those variables are not sent back to the driver program with whatever updated values they might have.

To address these inconveniences, the Spark API provides both **accumulators** and **broadcast variables**.

### Accumulators

Accumulators are variables that expose only an addition operation. This means we can only add values but not delete, or modify existing values. In other words, accumulators provide a simple way of aggregating values from the worker nodes back to the driver program.

Let’s start with a code example to illustrate. Let’s work on a brief project to demonstrate the usage of accumulators and broadcast variables.



# Sharing data in a cluster

Sharing data in a distributed environment, regardless of the use case, can be confusing.

Understanding the scope (where the variables “live”) and lifecycle (how the values change) of shared variables while executing code in a cluster presents itself as a challenging task.


Within the Spark ecosystem, variables can be passed down to objects that operate in a distributed fashion. Still, these are copies with a different state each while execution takes place.

Furthermore, this is a one-way type of communication, meaning that those variables are not sent back to the driver program with whatever updated values they might have.

To address these inconveniences, the Spark API provides both **accumulators** and **broadcast variables**.

## Accumulators

Accumulators are variables that expose only an addition operation. This means we can only add values but not delete, or modify existing values. In other words, accumulators provide a simple way of aggregating values from the worker nodes back to the driver program.

Let’s start with a code example to illustrate. Let’s work on a brief project to demonstrate the usage of accumulators and broadcast variables.



Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Accumulators and Broadcast Variables

Sharing data in a cluster

Accumulators