Spark SQL Data Source

Learn about the various sources and formats of data that can be read and written using Spark SQL.

We'll cover the following...

Reading data into DataFrames
DataFrameReader
DataFrameWriter
Formats
Parquet
JSON
CSV
Other formats

Reading data into DataFrames

Once data has been ingested, processed, and loaded into Spark SQL databases and tables, it can be read as DataFrames. An example is shown below:

scala> val movies = spark.read.format("csv")
                              .option("header", "true")
                              .option("samplingRatio", 0.001)
                              .option("inferSchema", "true")
                              .load("/data/BollywoodMovieDetail.csv")

scala> movies.write.saveAsTable("movieData")

scala> val movieTitles = spark.sql("SELECT title FROM movieData")

scala> movieTitles.show(3, false)
+---------------------------------+
|title                            |
+---------------------------------+
|Albela                           |
|Lagaan: Once Upon a Time in India|
|Meri Biwi Ka Jawab Nahin         |
+---------------------------------+
only showing top 3 rows

In the above example, we create the Spark SQL table movieData and then execute a Spark SQL query to return only the titles of the movies as a DataFrame.

DataFrameReader

We have touched upon DataFrameReader briefly in an earlier lesson. It is the core construct used for reading data from a source into a DataFrame. The pattern of stringing methods together is common in Spark and is also recommended when using DataFrameReader. The usage template occurs as follows:

DataFrameReader.format(args).option("key", "value").schema(args).load()

We can’t instantiate the DataFrameReader ...

Spark Overview

DataFrames

Datasets

Spark SQL

Summary

Spark SQL Data Source

Reading data into DataFrames

DataFrameReader