Data Science in Production: Building Scalable Model Pipelines/

...

- Sklearn Streaming 1

Consume messages from Kafka in PySpark in the first part of streaming in sklearn.

We'll cover the following...

Kafka with Databricks

PySpark workflow
Getting started

UDFs for checking data transmission
Example

To build an end-to-end streaming pipeline with Kafka, we’ll leverage Spark streaming to process and transform data as it arrives. The structured streaming enhancements introduced in Spark 2.3 enable working with dataframes and Spark SQL while abstracting away many of the complexities of dealing with batching and processing data sets.

In this and the next lesson, we’ll set up a PySpark streaming pipeline that fetches data from a Kafka topic, applies a sklearn model, and writes the output to a new topic. The entire workflow is a single DAGDAG that continuously runs and processes messages from a Kafka service.

Kafka with Databricks

In order to get Kafka to work with Databricks, we’ll need to edit the Kafka configuration to work with external connections because Databricks runs on a separate VPC and a potentially separate cloud than the Kafka service. Previously, we used the bootstrap approach to refer to brokers using localhost as the IP. On AWS, the Kafka startup script will use the internal IP to listen for ...

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

- Sklearn Streaming 1

Kafka with Databricks