- Sklearn Streaming 2

Build a streaming pipeline that applies a sklearn model in the second part of streaming in sklearn.

To implement the streaming model pipeline, we’ll use PySpark with a Python UDF to apply model predictions as new elements arrive.


A Python UDF operates on a single row, while a Pandas UDF operates on a partition of rows. The code for this pipeline is shown in the PySpark snippet below, which first trains a model on the driver node, sets up a data sink for a Kafka stream, defines a UDF for applying an ML model, and then publishes the scores to a new topic as a pipeline output.

Replace {external IP} with the public IP of your machine or EC2 instance.

Get hands-on with 1200+ tech skills courses.