Mastering Big Data with Apache Spark and Java/

...

Schema Manipulation

Let’s focus on expanding the concept and use cases for the schema of a DataFrame.

We'll cover the following...

Schemas
- Working with schemas programmatically
- Plain Old Java Objects (POJOs) and schemas

Schemas

In previous lessons, we interchangeably used the terms “schema” and “structure” when learning how to manipulate a DataFrame, an abstraction with a well-defined representation composed of columns and rows.

But there is a clear and substantial difference to make now that we are using the SQL syntax in Spark in conjunction with the DataFrame API:

A schema in Spark defines the number and type of columns that a DataFrame comprises, and it should not be confused with an RMDBS schema, which most of the time refers to a grouping of tables as the logical construct for a data model.

Every time a DataFrame is created—from a file, from another DataFrame as a result of transformations applied, and so on—a schema is attached to it, or rather, it specifies its representational structure. This schema can also be created and manipulated programmatically, the latter of which we learned in previous lessons.

Let’s learn how to create a schema and a DataFrame from a collection of elements in our Java program by simply writing code.

Working with schemas programmatically

For this lesson, the codebase looks like this:

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Schema Manipulation

Schemas

Working with schemas programmatically