Transforming DataFrames in Databricks
Explore how to transform DataFrames in Databricks by selecting relevant columns, filtering rows, adding new columns, renaming fields, sorting data, and performing aggregations. Understand Spark's lazy evaluation concept and learn to chain transformations for efficient data processing in PySpark notebooks.
In the previous lessons, you learned how to create and inspect a DataFrame. But real-world data work does not stop at viewing data. Most datasets are messy, oversized, or contain irrelevant information. Before analysis or reporting, you must transform the data.
Data transformation includes:
Selecting specific columns
Filtering rows
Creating new columns
Renaming columns
Sorting data
Dropping unnecessary fields
This lesson will walk you through each transformation step inside your Databricks notebook.
Create a working DataFrame
Before transforming data, we need a dataset. In production, data may come from Delta tables or cloud storage. For learning, we will manually create a structured dataset.
Add a new notebook cell and run the following code:
data = [("Ali", 25, "Lahore", 50000),("Sara", 30, "Karachi", 60000),("Ahmed", 35, "Islamabad", 70000),("Fatima", 28, "Lahore", 65000),("Usman", 40, "Karachi", 80000)]columns = ["name", "age", "city", "salary"]df = spark.createDataFrame(data, columns)df.show()
After running this cell, Databricks will execute the Spark job and display the DataFrame in tabular format below the cell.