- Sklearn Streaming 1

Consume messages from Kafka in PySpark in the first part of streaming in sklearn.

To build an end-to-end streaming pipeline with Kafka, we’ll leverage Spark streaming to process and transform data as it arrives. The structured streaming enhancements introduced in Spark 2.3 enable working with dataframes and Spark SQL while abstracting away many of the complexities of dealing with batching and processing data sets.

In this and the next lesson, we’ll set up a PySpark streaming pipeline that fetches data from a Kafka topic, applies a sklearn model, and writes the output to a new topic. The entire workflow is a single DAGDAG that continuously runs and processes messages from a Kafka service.

Kafka with Databricks

In order to get Kafka to work with Databricks, we’ll need to edit the Kafka configuration to work with external connections because Databricks runs on a separate VPC and a potentially separate cloud than the Kafka service. Previously, we used the bootstrap approach to refer to brokers using localhost as the IP. On AWS, the Kafka startup script will use the internal IP to listen for connection, and in order to enable connections from remote machines, we’ll need to update the configuration to use the external IP, as shown below.

Get hands-on with 1200+ tech skills courses.