Datasets

Get an introduction to the strongly typed Datasets API available in Spark.

Datasets

Below is the definition of a Dataset from the official Databricks documentation:

“A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. Datasets are a type-safe structured API available in statically typed, Spark supported languages Java and Scala. Datasets are strictly a JVM language feature. Datasets aren’t supported in R and Python since these languages are dynamically typed languages”.

After Spark 2.0, RDD was replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.

Get hands-on with 1200+ tech skills courses.