Mastering Big Data with Apache Spark and Java/

...

Dataset: a DataFrame of POJOs

Learn about the Dataset abstraction and its relation to a DataFrame and the map() function.

We'll cover the following...

What is a dataset?
Benefits of using a dataset
The code example

Outline of the project’s flow
Dataset code walkthrough

What is a dataset?

In previous lessons, we showed code snippets where the following was referred to as a DataFrame:

Dataset<Row> df = ...

In the Spark world and by convention, a dataset of rows is referred to as a DataFrame, but dataset objects typed to any different classes, including Plain Old Java Objects (POJOs), are called datasets.

The name isn’t the only difference. A DataFrame in Spark, or “dataset of rows”, comes with a richer API out of the box.

We’ve already used some methods from that API to manipulate the schema, and that’s just the tip of the iceberg.

Benefits of using a dataset

The main benefit of using a Dataset is the possibility of typing it to a POJO or an object from our business domain. In programmatic terms, it means following the below syntax:

Dataset<MyClass>

In turn, this means we’re not limited to working with DataFrames of Spark types (Integer, String, Binary, Date, etc.). Instead, it’s possible to have and map information to a collection of objects from our application’s domain.

One limitation though, is that the totality of the DataFrame API (and methods that it exposes) won’t be available for Datasets typed to our POJOs. However, there are some workarounds to mitigate this, such as custom mapping and conversions between the two Spark abstractions.

The code example

It’s time to play around with Datasets. As usual, it might be of considerable help to diagram what this project involving Datasets does.

Outline of the project’s flow

The below diagram shows the DataFrame as a Dataset of Row type and the conversion to make it a Dataset of Car type (example POJO used in the project.)

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Dataset: a DataFrame of POJOs

What is a dataset?

Benefits of using a dataset

The code example

Outline of the project’s flow