Search⌘ K
AI Features

Introduction to PySpark for Batch Pipelines

Explore how to use PySpark to build large-scale batch model pipelines that process billions of records and millions of users. Understand dataframe operations, batch predictions on cloud storage like S3 and GCP, and how to leverage Pandas UDFs for distributed feature engineering and deep learning in scalable production workflows.

Overview

Spark is a general-purpose computing framework that can scale to massive data volumes. It builds upon prior big data tools such as Hadoop and MapReduce, while providing significant improvements in the expressivity of the languages it supports. One of the core components of Spark is resilient distributed datasets (RDD), which enable clusters of machines to perform workloads in a coordinated and fault-tolerant manner. In more recent versions of Spark, the Data frame API provides an abstraction on top of RDDs that resembles the same data structure in R and Pandas. PySpark is the Python interface for Spark, and it provides an API for working with large-scale datasets in a distributed computing environment.

PySpark for data science

PySpark is an extremely valuable tool for data scientists because it can streamline the process of translating prototype models into production-grade model workflows. At Zynga, our data science team owns a number of production-grade systems that provide useful signals to our game and marketing teams. By using PySpark, we’ve been able to reduce the amount of support we need from engineering teams to scale up models from concept to production.

Improving scalability

Up until now in this course, all of the models we’ve built and deployed have been targeted at single machines. While we are able to scale up models that serve multiple machines using Lambda, ECS, and GKS, these containers worked in isolation and there was no coordination among nodes in these environments. With PySpark, we can build model workflows that are designed to operate in cluster environments for both model training and model serving.

This means that data scientists can now tackle much larger-scale problems than previously possible using prior Python tools. PySpark provides a nice balance between expressive programming languages and APIs to Spark versus more legacy options such as MapReduce. A general trend is that there is decrement in Hadoop’s usage because more data science and engineering teams are switching to Spark ecosystems.

In the next chapter, we’ll explore another distributed computing ecosystem for data science called Cloud Dataflow, but for now, Spark is the open-source leader in this space. PySpark was one of my main motivations for switching from R to Python for data science workflows.

Goals

The goal of this chapter is to provide an introduction to PySpark for Python programmers that shows how to build large-scale model pipelines for batch scoring applications, where you may have billions of records and millions of users that need to be scored.

While production-grade systems will typically push results to application databases, in this chapter, we’ll focus on batch processes that pull in data from a lake and push results back to the data lake for other systems to use. We’ll explore pipelines that perform model applications for both AWS and GCP.

While the datasets used in this chapter rely on AWS and GCP for storage, the Spark environment does not have to run on either of these platforms and instead can run on Azure, other clouds, or on-prem Spark clusters.

Chapter learning outcomes

We’ll cover a variety of different topics in this chapter to show different use cases of PySpark for scalable model pipelines.

After showing how to make data available to Spark on S3, we’ll cover some of the basics of PySpark focusing on dataframe operations. Next, we’ll build out a predictive model pipeline that reads in data from S3, performs batch model predictions, and writes the results to S3.

We’ll follow this by demonstrating how a newer feature called Pandas UDFs can be used with PySpark to perform distributed-deep learning and feature engineering.

To conclude the chapter, we’ll build another batch model pipeline, this time, using GCP and discuss how to productize workflows in a Spark ecosystem.