Datasets
Explore Spark Datasets to understand their type-safe, immutable structure and how encoders optimize data serialization. Learn the differences between Datasets and DataFrames, discover their specific use cases, and practice creating Datasets with Java and Scala.
We'll cover the following...
Datasets
The Databricks official definition for a Dataset reads: A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. Datasets are a type-safe structured API available in the statically typed, Spark supported languages Java and Scala. Datasets are strictly a JVM language feature. Datasets aren’t supported in R and Python because those languages are dynamically-typed. After Spark 2.0, RDD was replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.
Datasets are possible because of a feature called the encoder. The encoder coverts JVM types to Spark SQL’s specialized internal (tabular) representation. Encoders are ...