Search⌘ K
AI Features

Introduction to Data Transformation

Explore essential data transformation operations in both PySpark and pandas to enhance data processing skills. Understand how to aggregate data, compute statistical summaries, work with date and time, and perform SQL-like joins and pivots. Gain practical insights using PySpark's functions module applied to real datasets, preparing you to handle common tasks confidently.

We'll cover the following...

Overview

PySpark and pandas’ native API provides almost all the commonly used data transformation techniques as functions or methods. Because this API’s list of objects, functions, and methods is so extensive, we’ll only explore a few of them in this course.

Data transformation operations

The commonly used data transformation operations are as follows:

  • Aggregate a data set using single or multiple conditions.

  • Compute a statistical summary report (mean, median, etc.).

  • Work with date, time, or timestamp.

  • Use SQL-like expressions in PySpark with the expr function.

  • Perform SQL-like joins to combine multiple DataFrames.

  • Perform Excel-like pivots in DataFrames.

Note:

Most of the techniques we use in this chapter are part of the pyspark.sql.functions module. The functions module of pyspark.sql has all the built-in functions available to apply on a DataFrame. We might only need a few of these functions to accomplish any particular project, so a minimal learning approach is preferred. We’ll cover the functions we need for our exploratory data analysis of the “Toys and Games” dataset. The usage of other functions is very similar, which means you’ll understand how to use those functions even if we don’t cover them in our course.