Productizing PySpark
Explore how to productize PySpark pipelines by scheduling batch jobs with workflow, cloud, and vendor tools. Understand the differences between ephemeral and persistent clusters and how to implement quality checks and monitoring to ensure reliable, scalable model pipelines in production environments.
We'll cover the following...
Scheduling
Once you’ve tested a batch model pipeline in a notebook environment, there are a few different ways of scheduling the pipeline to run on a regular schedule.
For example, you may want a churn prediction model for a mobile game to run every morning and publish the scores to an application database. Similar to the workflow tools we covered in the previous chapter, a PySpark pipeline should have monitoring in place for any failures that may occur.
Techniques
There are a few different approaches for scheduling PySpark jobs to run: ...