Mastering Big Data with Apache Spark and Java/

...

UDFs: User-defined Functions

Get introduced to User-Defined Functions used to extend the already extensive Spark functionality.

We'll cover the following...

Extending Spark functionality
UDFs and the Spark ecosystem
UDFs’ use cases

Extending Spark functionality

User-Defined Functions, or UDFs, allow the Spark developer to expand the possibilities of what Spark already offers in the DataFrame API.

It is also possible to integrate different, custom-made libraries with the Spark API and use them in conjunction with UDFs.

A helpful analogy to describe what UDFs are, in this context, is to think of them as plugins: a core functionality is integrated and interfaces allow to “plug” extra pieces of functionality onto it. The core functionality is only aware of the plugins on a surface level and “sees” them as “black boxes.”

Naturally, for this to occur the UDFs have to abide by different Spark API contracts exposed by different interfaces, but the dependency and coupling of code ends there.

UDFs and the Spark ecosystem

As the name implies, UDFs are functions, and so they can receive arguments (currently the limit is 22), or those arguments can be zero; they also produce a return value as their output.

By integrating and declaring UDFs within the Spark Library, we are able to work on the columns of a DataFrame and apply custom functionality.

Let’s reiterate the Spark code execution dynamics. The driver program executes (or delegates most of) the code associated with the declaration, definition, and triggering of transformations—the plumbing in our code, to use a simple analogy.

The cluster nodes do the heavy lifting and process the volumes of information by performing the actual transformations themselves.

In this context, UDFs are sent as code to the worker nodes, as with any Spark application. Fortunately, Spark takes care of the code transport for us.

However, any custom libraries (jars) the UDFs might use have to reside beforehand in the cluster’s nodes.

The below diagram provides some visual interpretation:

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

UDFs: User-defined Functions

Extending Spark functionality

UDFs and the Spark ecosystem