Transforming DataFrames in Databricks
Explore how to transform DataFrames in Databricks using PySpark basics. Learn to select relevant columns, filter rows, add and rename columns, sort and drop data, and perform aggregations. Understand lazy evaluation in Spark to optimize your data workflows efficiently.
In the previous lessons, you learned how to create and inspect a DataFrame. But real-world data work does not stop at viewing data. Most datasets are messy, oversized, or contain irrelevant information. Before analysis or reporting, you must transform the data.
Data transformation includes:
Selecting specific columns
Filtering rows
Creating new columns
Renaming columns
Sorting data
Dropping unnecessary fields
This lesson will walk you through each transformation step inside your Databricks notebook.
Create a working DataFrame
Before transforming data, we need a dataset. In production, data may come from Delta tables or cloud storage. For learning, we will manually create a structured dataset.
Add a new notebook cell and run the following code:
data = [("Ali", 25, "Lahore", 50000),("Sara", 30, "Karachi", 60000),("Ahmed", 35, "Islamabad", 70000),("Fatima", 28, "Lahore", 65000),("Usman", 40, "Karachi", 80000)]columns = ["name", "age", "city", "salary"]df = spark.createDataFrame(data, columns)df.show()
After running this cell, Databricks will execute the Spark job and display the DataFrame in tabular format below the cell.