Mastering Big Data with Apache Spark and Java/

...

Working with DataFrame's Schemas

Learn about DataFrame's structure or schema with practical examples from a new project.

We'll cover the following...

Working on ingested data
Inspecting a DataFrame’s schema
Transforming the schema
Applying normalization
- Removing duplicates from Subjects table DataFrame
Books DataFrame: Adding ID and foreign key columns

Let’s imagine a scenario where the requirement is to ingest one CSV file with a specific format that doesn’t match our DataSource model. Changing an already existing data model can be too costly and impact other applications that feed off the database.

Fortunately, we already know how to modify the data when it resides on a Spark DataFrame and change its structure (such as adding one column in one of our previous lessons). The API also offers the possibility of removing columns and other exciting operations. Let’s see how we can achieve it.

Working on ingested data

The previous hypothetical requirement can be defined as the following

A client is sending data, to store in our DataSource (DB), in a CSV format that doesn’t fit into our normalized data model.

Our application should perform the necessary transformations to persist said information to the DB, with a matching structure.

The following widget contains the codebase for this lesson:

The application ingests a CSV file (left on the diagram) using the same API methods we’ve used before. Right after this, it calls two transformations on the DataFrame.

But first, we need to learn new API invocations to inspect the schema or structure of the DataFrame. This is necessary because our database has been previously normalized to split different information into related tables, rather than having only one massive hard-to-maintain table.

Afterwards, we can apply transformations to drop columns and separate the information into two DataFrames that resemble the two tables in the database.

Inspecting a DataFrame’s schema

The code to read from a ...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Working with DataFrame's Schemas

Working on ingested data

Inspecting a DataFrame’s schema