Overview

Spark is a general-purpose computing framework that can scale to massive data volumes. It builds upon prior big data tools such as Hadoop and MapReduce, while providing significant improvements in the expressivity of the languages it supports. One of the core components of Spark is resilient distributed datasets (RDD), which enable clusters of machines to perform workloads in a coordinated and fault-tolerant manner. In more recent versions of Spark, the Data frame API provides an abstraction on top of RDDs that resembles the same data structure in R and Pandas. PySpark is the Python interface for Spark, and it provides an API for working with large-scale datasets in a distributed computing environment.

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

Introduction to PySpark for Batch Pipelines

Overview

PySpark for data science