Search⌘ K
AI Features

- Sklearn Streaming 2

Explore how to build and run scalable streaming pipelines that apply machine learning model predictions in real time using PySpark and Kafka. Learn to define Python UDFs for streaming data, set up Kafka streams, and test message consumption to achieve near real-time processing and scalable deployment.

To implement the streaming model pipeline, we’ll use PySpark with a Python UDF to apply model predictions as new elements arrive.

Example

A Python UDF operates on a single row, while a Pandas UDF operates on a partition of rows. The code for this pipeline is shown in the PySpark snippet below, which first trains a model on the driver node, sets up a data sink for a Kafka stream, defines a UDF for applying an ML model, and then publishes the scores to a new topic as a pipeline output.

... ...