Search⌘ K
AI Features

Transforming DataFrames in Databricks

Explore how to transform DataFrames in Databricks using PySpark basics. Learn to select relevant columns, filter rows, add and rename columns, sort and drop data, and perform aggregations. Understand lazy evaluation in Spark to optimize your data workflows efficiently.

In the previous lessons, you learned how to create and inspect a DataFrame. But real-world data work does not stop at viewing data. Most datasets are messy, oversized, or contain irrelevant information. Before analysis or reporting, you must transform the data.

Data transformation includes:

  • Selecting specific columns

  • Filtering rows

  • Creating new columns

  • Renaming columns

  • Sorting data

  • Dropping unnecessary fields

This lesson will walk you through each transformation step inside your Databricks notebook.

Create a working DataFrame

Before transforming data, we need a dataset. In production, data may come from Delta tables or cloud storage. For learning, we will manually create a structured dataset.

Add a new notebook cell and run the following code:

data = [
("Ali", 25, "Lahore", 50000),
("Sara", 30, "Karachi", 60000),
("Ahmed", 35, "Islamabad", 70000),
("Fatima", 28, "Lahore", 65000),
("Usman", 40, "Karachi", 80000)
]
columns = ["name", "age", "city", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
Creating and displaying a sample DataFrame inside Databricks

After running this cell, Databricks will execute the Spark job and display the DataFrame in tabular format below the cell.

Output showing the full DataFrame
Output showing the full DataFrame
...