Datasets with Scala Case Class and Java Bean Class

Learn how Scala's case classes and Java's bean classes can be used with Datasets.

We'll cover the following...

Generating data using SparkSession
Filter
Map

Generating data using SparkSession

We can also create a Dataset using a SparkSession object as demonstrated below.

## define  class
case class MovieDetailShort(imdbID: String, rating: Int)

## define random number generator
scala> val rnd = new scala.util.Random(9)

## create some data
scala> val data = for(i <- 0 to 100) yield (MovieDetailShort("movie-"+i, rnd.nextInt(10)))

## use spark session to generate a Dataset consisting of objects created in the previous step 
scala> val datasetMovies = spark.createDataset(data)

## display three rows from the Dataset
scala> datasetMovies.show(3)
+-------+------+
| imdbID|rating|
+-------+------+
|movie-0|     0|
|movie-1|     3|
|movie-2|     8|
+-------+------+
only showing top 3 rows

When working with Scala, we didn’t have to explicitly specify the encoder since Spark implicitly handles it for us. This is not the case for Java, where we have to specify the encoder. The equivalent Java bean class for MovieDetailShort is listed below:

public class MovieDetailShort implements Serializable {

...

Spark Overview

DataFrames

Datasets

Spark SQL

Summary

Datasets with Scala Case Class and Java Bean Class

Generating data using SparkSession