Working with DataFrames

Work through an example to learn how to read and write Spark DataFrames.

We'll cover the following

Reading and writing data in Spark is very convenient given the high-level abstractions available to connect to a variety of external data sources, such as Kafka, RDBMSs, or NoSQL stores. Spark provides an interface, DataFrameReader, that allows us to read data into a DataFrame from various sources and in a number of formats such as JSON, CSV, Parquet, or Text.

For any meaningful data analysis, we’ll be creating DataFrames from data files. When doing so, we can either instruct Spark to infer the schema of the data itself or specify it for Spark. Let’s see examples of both below:

Creating DataFrames

The following snippet reads the data file BollywoodMovieDetail.csv from the location /data/BollywoodMovieDetail.csv.

val movies = spark.read.format("csv")
  .option("header","true")
  .option("inferSchema","true")
  .load("/data/BollywoodMovieDetail.csv")

The Spark inferred schema can be examined as follows:

movies.schema

The output of the above command is shown in the widget below:

Get hands-on with 1000+ tech skills courses.