Working with DataFrame's Schemas

Learn about DataFrame's structure or schema with practical examples from a new project.

We'll cover the following

Working on ingested data
Inspecting a DataFrame’s schema
Transforming the schema
Applying normalization
- Removing duplicates from Subjects table DataFrame
Books DataFrame: Adding ID and foreign key columns

Let’s imagine a scenario where the requirement is to ingest one CSV file with a specific format that doesn’t match our DataSource model. Changing an already existing data model can be too costly and impact other applications that feed off the database.

Fortunately, we already know how to modify the data when it resides on a Spark DataFrame and change its structure (such as adding one column in one of our previous lessons). The API also offers the possibility of removing columns and other exciting operations. Let’s see how we can achieve it.

Working on ingested data

The previous hypothetical requirement can be defined as the following

A client is sending data, to store in our DataSource (DB), in a CSV format that doesn’t fit into our normalized data model.

Our application should perform the necessary transformations to persist said information to the DB, with a matching structure.

The following widget contains the codebase for this lesson:

Get hands-on with 1200+ tech skills courses.

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Working with DataFrame's Schemas

Working on ingested data