Datasets

This lesson examines the concept of Datasets in Spark.

Datasets

The Databricks official definition for a Dataset reads: A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. Datasets are a type-safe structured API available in the statically typed, Spark supported languages Java and Scala. Datasets are strictly a JVM language feature. Datasets aren’t supported in R and Python because those languages are dynamically-typed. After Spark 2.0, RDD was replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.

Datasets are possible because of a feature called the encoder. The encoder coverts JVM types to Spark SQL’s specialized internal (tabular) representation. Encoders are highly specialized and optimized code generators that generate custom bytecode for serialization and deserialization of your data. Encoders are required by all Datasets. The encoder maps domain-specific type to Spark’s internal representation for that type. Domain-specific types are expressed as beans for Java and case classes for Scala. The tabular representation is stored using Spark’s internal Tungsten binary format, allowing for operations on serialized data and improved memory utilization. While, using the Dataset API, Spark generates code at runtime to serialize a Java object into an internal binary structure and vice versa. This conversion can be a slight hit on performance but there are several benefits. For example, because Spark understands the structure of data in Datasets, it can create a more optimal layout in memory when caching Datasets.

Differences with DataFrames

We'll need to contrast DataFrames and Datasets to gain a better understanding of both. Datasets check if types conform to the specification at compile time. DataFrames aren't truly untyped as their types are maintained by Spark, but the verification that the types conform to the specification in the schema is done at runtime. Said another way, DataFrames can be thought of are the Datasets of type Row, which is Spark's internal and optimized in-memory representation for computation. Having its own internal representation of types allows Spark to skip JVM types that can be slow to instantiate and have garbage collection costs.

Use-cases for Datasets

It may seem redundant to have Datasets when we already have DataFrames. However, there are certain scenarios uniquely addressed by Datasets:

  • Some operations cannot be expressed with DataFrames, only with Datasets.

  • You may have a desire for type-safety. For example, attempting to multiply two string variables in code will fail at compile-time, instead of at run-time. Additionally, development may be helpful as IDEs and other tools can provide auto-complete and other hints when objects are strongly typed.

  • If all of your data and transformations accept case classes (Scala), it is trivial to reuse them for both distributed and local workloads.

Creating Datasets

Datasets require the user to know the schema ahead of time. Let’s pretend we have a Java class that represents a car. We can read records in the following text file as objects of the class Car. The contents of the text file are as follows:

mercedes, 2020
toyota, 2021
porsche, 2019
Ford, 1995
Tesla, 2099

The Java code to read the above file contents appears as follows:

Get hands-on with 1200+ tech skills courses.