- Sklearn Streaming 1
Discover how to set up an end-to-end streaming pipeline that consumes data from Kafka, applies a sklearn model using PySpark structured streaming, and outputs results to a new Kafka topic. Understand configuration steps for connecting Kafka with Databricks, use UDFs to validate data transmission, and build modular, scalable workflows ideal for real-time model deployment.
We'll cover the following...
To build an end-to-end streaming pipeline with Kafka, we’ll leverage Spark streaming to process and transform data as it arrives. The structured streaming enhancements introduced in Spark 2.3 enable working with dataframes and Spark SQL while abstracting away many of the complexities of dealing with batching and processing data sets.
In this and the next lesson, we’ll set up a PySpark streaming pipeline that fetches data from a Kafka topic, applies a
sklearn model, and writes the output to a new topic. The entire workflow is a single
Kafka with Databricks
In order to get Kafka to work with Databricks, we’ll need to edit the Kafka configuration to work with external connections because Databricks runs on a separate VPC and a potentially separate cloud than the Kafka service. Previously, we used the bootstrap approach to refer to brokers using localhost as the IP. On AWS, the Kafka startup script will use the internal IP to listen for ...