- Sklearn Streaming 2
Explore how to build and run scalable streaming pipelines that apply machine learning model predictions in real time using PySpark and Kafka. Learn to define Python UDFs for streaming data, set up Kafka streams, and test message consumption to achieve near real-time processing and scalable deployment.
To implement the streaming model pipeline, we’ll use PySpark with a Python UDF to apply model predictions as new elements arrive.
Example
A Python UDF operates on a single row, while a Pandas UDF operates on a partition of rows. The code for this pipeline is shown in the PySpark snippet below, which first trains a model on the driver node, sets up a data sink for a Kafka stream, defines a UDF for applying an ML model, and then publishes the scores to a new topic as a pipeline output.
Replace
{external IP}with the public IP of your machine or EC2 instance.
The script first trains a logistic regression model using data fetched from GitHub. The model object is created on the driver node, but is copied to the worker nodes when used by the UDF.
Defining UDF
The next step is to ...