Introduction to PySpark for Batch Pipelines
Explore how to use PySpark to build large-scale batch model pipelines that process billions of records and millions of users. Understand dataframe operations, batch predictions on cloud storage like S3 and GCP, and how to leverage Pandas UDFs for distributed feature engineering and deep learning in scalable production workflows.
We'll cover the following...
Overview
Spark is a general-purpose computing framework that can scale to massive data volumes. It builds upon prior big data tools such as Hadoop and MapReduce, while providing significant improvements in the expressivity of the languages it supports. One of the core components of Spark is resilient distributed datasets (RDD), which enable clusters of machines to perform workloads in a coordinated and fault-tolerant manner. In more recent versions of Spark, the Data frame API provides an abstraction on top of RDDs that resembles the same data structure in R and Pandas. PySpark is the Python interface for Spark, and it provides an API for working with large-scale datasets in a distributed computing environment.
PySpark for data science
PySpark is an extremely ...