Search⌘ K
AI Features

Introduction to User-defined Functions

Explore how to write user-defined functions in PySpark to extend data transformations beyond built-in methods. Understand translating Python functions to PySpark types, decorate them with type annotations, and apply them efficiently within PySpark DataFrames.

We'll cover the following...

Overview

The majority of the use cases we encounter in our day-to-day analysis or data engineering work can be resolved with methods or functions provided by the SQL or DataFrame API in PySpark. If built-in methods are not enough, we can write our own function, which we can use for a custom transformation. Writing user-defined functions requires a deeper understanding of both data structure and how a pure Python data structure is represented in PySpark. The return type of user defined functions (UDF) must be static. Therefore, the return data structure must be provided by us in a form of PySpark type. Moreover, UDF are the most expensive (less optimized) operations, hence we use them only when necessary and have no other choice.

User-defined functions

PySpark is a set of libraries written in Scala used to process large-scale data. Before we define some UDF, we need to go over the types PySpark provides and how it’s connected to Scala objects. Any UDF written in PySpark is going to be translated into equivalent Scala code at runtime. By the end of this chapter, we should be able to do the following:

  • Translate a Python data structure into a data structure that is compatible with PySpark DataFrame.

  • Decorate a pure Python function with proper PySpark type annotation.

  • Apply UDF in PySpark DataFrame.