Datasets

This lesson examines the concept of Datasets in Spark.

Datasets

The Databricks official definition for a Dataset reads: A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. Datasets are a type-safe structured API available in the statically typed, Spark supported languages Java and Scala. Datasets are strictly a JVM language feature. Datasets aren’t supported in R and Python because those languages are dynamically-typed. After Spark 2.0, RDD was replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.

Datasets are possible because of a feature called the encoder. The encoder coverts JVM types to Spark SQL’s specialized internal (tabular) representation. Encoders are ...