tarball.tar.gz

DataFrameBasicsRun

SparkDataframeExampleDB

This course serves as a comprehensive introduction to the Spark Java API. Experienced Java developers will use object-oriented programming (OOP) principles to apply theory to Apache Spark and big data practice.

You’ll learn the basic components and architecture of Spark, a leading framework for building big data applications, before implementing them in Java. You’ll also explore data transformations like grouping, sorting, and joining. Further, you’ll learn to support SQL operations in the database and create a big data and batch template application with Java.

By the end of the course, you’ll be familiar with Apache Spark and know how to integrate big data with Java environments through the Spark Java API. You’ll wrap up by learning about monitoring and support functions for a live Spark Java environment.

Combining a leading big data framework with a leading programming language, this course will empower you to work efficiently with large volumes of data, and process at scale and speed.

Mastering Big Data with Apache Spark and Java

## Task, Partitions, and Transformations

Several steps are triggered when reading from a source, such as a CSV file, or when applying a transformation on read records.

Behind the scenes, Spark works based on a divide and conquer approach to deal with significant volumes of data.

When we applied a transformation in our previous lesson, we learned that such an operation occurs for the whole set of data. We also noted that this happens distributedly. So, how does Spark manage this? Let’s take a quick look.




The driver program kicks off processing in parallel when the first operation (reading from a CSV file) is triggered by fanning out the workload to the cluster’s worker nodes.

Let’s imagine a timeline where a succession of actions takes place, expressed first as code in the driver program (static in nature) and seen in the cluster nodes as processes (dynamic in nature). This can be visually represented as follows:


# Task, Partitions, and Transformations

Several steps are triggered when reading from a source, such as a CSV file, or when applying a transformation on read records.

Behind the scenes, Spark works based on a divide and conquer approach to deal with significant volumes of data.

When we applied a transformation in our previous lesson, we learned that such an operation occurs for the whole set of data. We also noted that this happens distributedly. So, how does Spark manage this? Let’s take a quick look.




The driver program kicks off processing in parallel when the first operation (reading from a CSV file) is triggered by fanning out the workload to the cluster’s worker nodes.

Let’s imagine a timeline where a succession of actions takes place, expressed first as code in the driver program (static in nature) and seen in the cluster nodes as processes (dynamic in nature). This can be visually represented as follows:


Get introduced to the mechanism behind a Spark transformation, and in particular, the memory scheme used to store the data with which a transformation works.

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Deep Dive: Transformations and Data Storage

Task, Partitions, and Transformations