Spark User Defined Functions
Explore how to define and apply Spark User Defined Functions (UDFs) to extend Spark SQL's capabilities with custom code. Understand the benefits of UDFs, use examples in Scala and PySpark including vectorized Pandas UDFs, and see key built-in functions for manipulating data.
We'll cover the following...
We have previously seen and worked with Spark’s in-built function, but Spark also allows users to define their own functionality wrapped inside user defined functions (UDFs) that can be invoked in Spark SQL. The major benefit of UDFs is reusability. UDFs exist per session and don’t persist within the underlying metastore. Let’s consider a simple function that returns the last two digits of the releaseYear value e.g., if the function is passed-in 2021, it’ll return 21. The function definition and its use is presented below:
val movies = spark.read.format("csv")
.option("header", "true")
.option("samplingRatio", 0.001)
.option("inferSchema", "true")
.load("/data/BollywoodMovieDetail. ...