Mastering Big Data with Apache Spark and Java/

...

Data Partitioning and Shuffling

Learn about the important concepts that every Spark developer should be familiar with: 'Partitioning' and 'Shuffling'.

We'll cover the following...

Data partitioning and shuffling
- Data partitioning
  - Describing partitions in the code
- Shuffling
  - Shuffling code example

Data partitioning and shuffling

The term “big data” refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations. These large data sets can generally fit in the memory of a running JVM application (provided we didn’t leave a nasty footprint of memory leaks), until the data is sent to a persistent data storage like a DB.

When we’re asked to process volumes of information by the millions or billions, traditional strategies and systems will begin to be unable to perform the task at hand.

Luckily for us, Spark is a tool that comes to our aid, and allows us to process these humongous volumes. However, the following question remains: How does Spark fit massive volumes of information into its nodes?

This lesson intends to provide some insight into this question and touches on some important related concepts.

Note: Spark does have its limitations. As any technology it is no silver bullet, but it specializes in dealing with big data. We will learn about Spark Performance and Tuning techniques in an upcoming lesson, which will allow us to use Spark more efficiently with the resources we have at hand.

Data partitioning

In the first chapters, we learned about partitions. If you need a refresher, feel free to review our previous learning before proceeding with this lesson.

Data partitioning is the mechanism in which Spark divides the “to-be-processed” data into partitions and places them in multiple cluster nodes. This mechanism is crucial as it can affect both performance and the use of available resources.

Why is data partitioning crucial? Let’s go through a scenario to illustrate this.

A company processes sales based on an ID for each transaction that occurred. Some of the standard reports generated rely on grouping together different sales information for different sellers, based on an ID. The volumes of the sales can range by the millions.

As Spark developers and big ...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Data Partitioning and Shuffling

Data partitioning and shuffling

Data partitioning