Search⌘ K
AI Features

UDF in Action

Explore how to create and apply user-defined functions (UDFs) in PySpark for custom transformations on DataFrames. Understand defining Python functions wrapped as UDFs with complex return types, adding new structured columns, and accessing nested fields. This lesson enables handling sophisticated data summaries within PySpark efficiently.

Create an in-memory DataFrame

To elaborate a more complex application of UDF, let’s first create an in-memory DataFrame as follows:

wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_Home_and_Kitchen_5.json
wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_meta_Home_and_Kitchen.json
wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Toys_and_Games_5.json.gz

Example of user-defined functions with an in-memory DataFrame

After successful code execution, we’ll see the message “Code Executed Successfully” in the terminal.

Adding a new column to DataFrame

Let’s say we want to add a new column that summarizes the Address column of the DataFrame. The new summary column should have three fields— the length of the full address, a boolean that indicates if there is a postcode or not, and the postcode itself. We can write the UDF as follows:

wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_Home_and_Kitchen_5.json
wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_meta_Home_and_Kitchen.json
wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Toys_and_Games_5.json.gz

Apply a UDF to a DataFrame

After successful code execution, we’ll see the message “Code Executed Successfully” in the terminal.

We first define a Python function, make_summary. This function ...