Search⌘ K
AI Features

Working with DataFrame's Schemas

Explore how to inspect and transform DataFrame schemas in Spark using Java. Understand methods to drop and rename columns, apply normalization, and prepare data to match database tables while handling CSV input that diverges from source models. This lesson helps you align data formats with normalized database structures for better integration.

Let’s imagine a scenario where the requirement is to ingest one CSV file with a specific format that doesn’t match our DataSource model. Changing an already existing data model can be too costly and impact other applications that feed off the database.

Fortunately, we already know how to modify the data when it resides on a Spark DataFrame and change its structure (such as adding one column in one of our previous lessons). The API also offers the possibility of removing columns and other exciting operations. Let’s see how we can achieve it.

Working on ingested data

The previous hypothetical requirement can be defined as the following


A client is sending data, to store in our DataSource (DB), in a CSV format that doesn’t fit into our normalized data model.

Our application should perform the necessary transformations to persist said information to the DB, with a matching structure.


The following widget contains the codebase for this lesson:

mvn install exec:exec
Project to ingest from CSV and work on DataFrame schemas

To ease our way into understanding the code snippet and application logic let’s illustrate the application’s flow with a simple diagram.

Note: If curiosity “sparks” within the reader, the following link can serve as a gentle introduction to what data normalization is. Although, it’s technically not necessary for this course.

The application ingests a CSV file (left on the diagram) using the same API methods we’ve used before. Right after this, it calls two transformations on the DataFrame.

But first, we need to learn new API invocations to inspect the schema or structure of the DataFrame. This is necessary because our database has been previously normalized to split different information into related tables, rather than having only one massive hard-to-maintain table.

Afterwards, we can apply transformations to drop columns and separate the information into two DataFrames that resemble the two tables in the database. ...