Working with DataFrames

Explore how to create and manipulate Spark DataFrames by reading data from various formats and sources. Understand schema inference and how to define schemas explicitly using both programmatic and DDL methods. Learn how to write DataFrames to external storage in different formats, focusing on structured data analysis and preparation.

We'll cover the following...

Creating DataFrames
Writing DataFrames

Reading and writing data in Spark is very convenient given the high-level abstractions available to connect to a variety of external data sources, such as Kafka, RDBMSs, or NoSQL stores. Spark provides an interface, DataFrameReader, that allows us to read data into a DataFrame from various sources and in a number of formats such as JSON, CSV, Parquet, or Text.

For any meaningful data analysis, we’ll be creating DataFrames from data files. When doing so, we can either instruct Spark to infer the schema of the data itself or specify it for Spark. Let’s see examples of both below:

Creating DataFrames

The following snippet reads the data file BollywoodMovieDetail.csv from the location /data/BollywoodMovieDetail.csv.

val movies = spark.read.format("csv")
  .option("header","true")
  .option("inferSchema","true")
  .load("/data/BollywoodMovieDetail.csv")

The Spark inferred schema can be examined as follows:

...

1.Spark Overview

2.DataFrames

3.Datasets

4.Spark SQL

5.Summary

Working with DataFrames

Creating DataFrames