UDF in Action
Explore how to create and apply user-defined functions (UDFs) in PySpark for custom transformations on DataFrames. Understand defining Python functions wrapped as UDFs with complex return types, adding new structured columns, and accessing nested fields. This lesson enables handling sophisticated data summaries within PySpark efficiently.
Create an in-memory DataFrame
To elaborate a more complex application of UDF, let’s first create an in-memory DataFrame as follows:
wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_Home_and_Kitchen_5.json wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_meta_Home_and_Kitchen.json wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Toys_and_Games_5.json.gz
After successful code execution, we’ll see the message “Code Executed Successfully” in the terminal.
Adding a new column to DataFrame
Let’s say we want to add a new column that summarizes the Address column of the DataFrame. The new summary column should have three fields— the length of the full address, a boolean that indicates if there is a postcode or not, and the postcode itself. We can write the UDF as follows:
wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_Home_and_Kitchen_5.json wget http://deepyeti.ucsd.edu/jianmo/amazon/sample/sample_meta_Home_and_Kitchen.json wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Toys_and_Games_5.json.gz
After successful code execution, we’ll see the message “Code Executed Successfully” in the terminal.
We first define a Python function, make_summary. This function ...