Datasets with Scala Case Class and Java Bean Class
Explore how to create Spark Datasets from Scala case classes and Java bean classes, including schema definition and the use of encoders. Understand how to apply higher-order functions like filter and map to manipulate datasets. Gain practical knowledge on using lambda expressions in Scala and equivalent functional interfaces in Java for dataset operations.
We'll cover the following...
We'll cover the following...
Generating data using SparkSession
We can also create a Dataset using a SparkSession object as demonstrated below.
## define class
case class MovieDetailShort(imdbID: String, rating: Int)
## define random number generator
scala> val rnd = new scala.util.Random(9)
## create some data
scala> val data = for(i <- 0 to 100) yield (MovieDetailShort("movie-"+i, rnd.nextInt(10)))
## use spark session to generate a Dataset consisting of objects created in the previous step
scala> val datasetMovies = spark.createDataset(data)
## display three rows from the Dataset
scala> datasetMovies.show(3)
+-------+------+
| imdbID|rating|
+-------+------+
|movie-0| 0|
|movie-1| 3|
|movie-2| 8|
+-------+------+
only showing top 3 rows
When working with Scala, we didn’t have to explicitly specify the encoder since Spark implicitly handles it for us. This is not the case for Java, where we have to specify the encoder. The equivalent Java bean class for MovieDetailShort is listed below:
public class MovieDetailShort implements ...