Schema Manipulation

Let’s focus on expanding the concept and use cases for the schema of a DataFrame.

We'll cover the following...

Schemas

In previous lessons, we interchangeably used the terms “schema” and “structure” when learning how to manipulate a DataFrame, an abstraction with a well-defined representation composed of columns and rows.

But there is a clear and substantial difference to make now that we are using the SQL syntax in Spark in conjunction with the DataFrame API:


A schema in Spark defines the number and type of columns that a DataFrame comprises, and it should not be confused with an RMDBS schema, which most of the time refers to a grouping of tables as the logical construct for a data model.


Every time a DataFrame is created—from a file, from another DataFrame as a result of transformations applied, and so on—a schema is attached to it, or rather, it specifies its representational structure. This schema can also be created and manipulated programmatically, the latter of which we learned in previous lessons.

Let’s learn how to create a schema and a DataFrame from a collection of elements in our Java program by simply writing code.

Working with schemas programmatically

For this lesson, the codebase looks like this:

mvn install exec:exec
Project with the codebase to manipulate DataFrame's schema

We can re-use the theme park CSV file from the previous lesson. Once loaded, we print the first record to remind ourselves of the information structure by invoking the method first() on the df object variable.

This variable returns a row object, allowing for schema inspection because every row belonging to a DataFrame shares its schema. ...