Distributed Feature Engineering

Overview

Feature engineering is a key step in a data science workflow, and sometimes, it is necessary to use Python libraries to implement this functionality. For example, the AutoModel system at Zynga uses the Featuretools library to generate hundreds of features from raw tracking events, which are then used as input to classification models. To scale up the automated feature engineering approach that we first explored in Automated Feature Engineering, we can use Pandas UDFs to distribute the feature application process. Like the prior section, we need to sample data when determining which transformation to perform, but when applying the transformation we can scale it to massive datasets.

For this lesson, we’ll use the game plays dataset from the NHL Kaggle example, which includes detailed play-by-play descriptions of the events that occurred during each match. Our goal is to transform the deep and narrow dataframe into a shallow and wide dataframe that summarizes each game as a single record with hundreds of columns. An example of loading this data in PySpark and selecting the relevant columns is shown in the snippet below. Before calling toPandas, we use the filter function to sample 0.3%0.3\% of the records and then cast the result to a Pandas frame, which has a shape of 10,71710,717 rows and 1616 columns.

Get hands-on with 1200+ tech skills courses.