- Pandas UDFs

Introduction to Pandas UDFs and its example.

We'll cover the following

What are Pandas UDFs?
Example

Code
Results

While PySpark provides a great deal of functionality for working with dataframes, it often lacks core functionality provided in Python libraries, such as the curve-fitting functions in SciPy. While it’s possible to use the toPandas function to convert dataframes into the Pandas format for Python libraries, this approach breaks when using large datasets.

What are Pandas UDFs?

Pandas UDFs are a newer feature in PySpark that help data scientists work around this problem, by distributing the Pandas conversion across the worker nodes in a Spark cluster. With a Pandas UDF, you define a group by an operation to partition the dataset into dataframes that are small enough to fit into the memory of worker nodes. Then, you author a function that takes a Pandas data frame as the input parameter and returns a transformed Pandas data frame as the result.

Behind the scenes, PySpark uses the PyArrow library to efficiently translate dataframes from Spark to Pandas and back from Pandas to Spark. This approach enables Python libraries, such as Keras, to be scaled up to a large cluster of machines.

Get hands-on with 1200+ tech skills courses.

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

- Pandas UDFs

What are Pandas UDFs?