Overview

Feature engineering is a key step in a data science workflow, and sometimes, it is necessary to use Python libraries to implement this functionality. For example, the AutoModel system at Zynga uses the Featuretools library to generate hundreds of features from raw tracking events, which are then used as input to classification models. To scale up the automated feature engineering approach that we first explored in Automated Feature Engineering, we can use Pandas UDFs to distribute the feature application process. Like the prior section, we need to sample data when determining which transformation to perform, but when applying the transformation we can scale it to massive datasets.

For this lesson, we’ll use the game plays dataset from the NHL Kaggle example, which includes detailed play-by-play descriptions of the events that occurred during each match. Our goal is to transform the deep and narrow dataframe into a shallow and wide dataframe that summarizes each game as a single record with hundreds of columns. An example of loading this data in PySpark and selecting the relevant columns is shown in the snippet below. Before calling toPandas, we use ...

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

Distributed Feature Engineering

Overview