Data Science in Production: Building Scalable Model Pipelines/

...

- Transforming Data

Manipulating and visualizing data in PySpark.

We'll cover the following...

The PySpark Dataframe API provides a variety of useful functions for aggregating, filtering, pivoting, and summarizing data. While some of these functionalities map well to Pandas operations, my recommendation for quickly getting up and handling munging data in PySpark is to use the SQL interface in dataframes in Spark, called Spark SQL. If you’re already using the pandasql or framequery libraries, then Spark SQL should provide a familiar interface.

If you’re new to these libraries, then the SQL interface still provides an approachable way of working with the Spark ecosystem. We’ll cover the Dataframe API later but first, start with the SQL interface to get up and running.

Exploratory data analysis (EDA)

Exploratory data analysis (EDA) is one of the key steps in a data science workflow for understanding the shape of a dataset. To work through this process in PySpark, we’ll load the stats dataset into a dataframe, expose it as a view, and calculate the summary statistics. The snippet below shows how to load the NHL stats dataset, expose it as a view to Spark, and run a query against the dataframe. The ...

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

- Transforming Data

Exploratory data analysis (EDA)