Search⌘ K
AI Features

Transforming DataFrames in Databricks

Explore how to transform DataFrames in Databricks by selecting relevant columns, filtering rows, adding new columns, renaming fields, sorting data, and performing aggregations. Understand Spark's lazy evaluation concept and learn to chain transformations for efficient data processing in PySpark notebooks.

In the previous lessons, you learned how to create and inspect a DataFrame. But real-world data work does not stop at viewing data. Most datasets are messy, oversized, or contain irrelevant information. Before analysis or reporting, you must transform the data.

Data transformation includes:

  • Selecting specific columns

  • Filtering rows

  • Creating new columns

  • Renaming columns

  • Sorting data

  • Dropping unnecessary fields

This lesson will walk you through each transformation step inside your Databricks notebook.

Create a working DataFrame

Before transforming data, we need a dataset. In production, data may come from Delta tables or cloud storage. For learning, we will manually create a structured dataset.

Add a new notebook cell and run the following code:

data = [
("Ali", 25, "Lahore", 50000),
("Sara", 30, "Karachi", 60000),
("Ahmed", 35, "Islamabad", 70000),
("Fatima", 28, "Lahore", 65000),
("Usman", 40, "Karachi", 80000)
]
columns = ["name", "age", "city", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
Creating and displaying a sample DataFrame inside Databricks

After running this cell, Databricks will execute the Spark job and display the DataFrame in tabular format below the cell.

Output showing the full DataFrame
Output showing the full DataFrame
...