UDFs: User-defined Functions

Get introduced to User-Defined Functions used to extend the already extensive Spark functionality.

Extending Spark functionality

User-Defined Functions, or UDFs, allow the Spark developer to expand the possibilities of what Spark already offers in the DataFrame API.

It is also possible to integrate different, custom-made libraries with the Spark API and use them in conjunction with UDFs.

A helpful analogy to describe what UDFs are, in this context, is to think of them as plugins: a core functionality is integrated and interfaces allow to “plug” extra pieces of functionality onto it. The core functionality is only aware of the plugins on a surface level and “sees” them as “black boxes.”

Naturally, for this to occur the UDFs have to abide by different Spark API contracts exposed by different interfaces, but the dependency and coupling of code ends there.

UDFs and the Spark ecosystem

As the name implies, UDFs are functions, and so they can receive arguments (currently the limit is 22), or those arguments can be zero; they also produce a return value as their output.

By integrating and declaring UDFs within the Spark Library, we are able to work on the columns of a DataFrame and apply custom functionality.

Let’s reiterate the Spark code execution dynamics. The driver program executes (or delegates most of) the code associated with the declaration, definition, and triggering of transformations—the plumbing in our code, to use a simple analogy.

The cluster nodes do the heavy lifting and process the volumes of information by performing the actual transformations themselves.

In this context, UDFs are sent as code to the worker nodes, as with any Spark application. Fortunately, Spark takes care of the code transport for us.

However, any custom libraries (jars) the UDFs might use have to reside beforehand in the cluster’s nodes.

The below diagram provides some visual interpretation:

Get hands-on with 1200+ tech skills courses.