...

Data Drift

Learn the various types of data drift and how to remedy them.

We'll cover the following...

Types of drift

Data drift is a very general name that describes a few situations where the underlying data is changing (or drifting) over time. Almost all real datasets have some kind of drift, so it’s critical to identify if a pipeline is susceptible to this phenomenon before deployment. Let’s consider the following problem statement: How can a model that learns from past and current data still perform equally well on future data if this future data looks nothing like it did in the past?

Data drift is potentially dangerous to a pipeline. A toy example we’ll refer to throughout this lesson is a simple computer vision model that recognizes and classifies dogs. The input to the model is various images (some with dogs, some without), and the output is a 0 or a 1 marking the absence or presence of a dog. Let’s assume that this model will be deployed for centuries.

Types of drift

There are many ways that models start becoming “out of style.” We consider shift of inputs, outputs, and more.

Covariate shift

A covariate shift occurs when the input data changes over time. Specifically, the independent variables (e.g., the images with and without dogs) are fundamentally different over time. An illustration of this with our toy example is the following:

The training data for the model is sourced from 2000–2023. The model is deployed for an incredibly long time, but photography fundamentally changes! Images from 2060 are fundamentally clearer and more detailed than images from 2023. ...

Introduction

Disasters in Data

Disasters in Models

Measuring Causal Relations with Python

Alternatives to Traditional ML

Adversarial Robustness of Neural Networks

Conclusion

Assessment: Disasters in ML Pipelines

Data Drift

Types of drift

Covariate shift