Data Drift

Learn the various types of data drift and how to remedy them.

Data drift is a very general name that describes a few situations where the underlying data is changing (or drifting) over time. Almost all real datasets have some kind of drift, so it’s critical to identify if a pipeline is susceptible to this phenomenon before deployment. Let’s consider the following problem statement: How can a model that learns from past and current data still perform equally well on future data if this future data looks nothing like it did in the past?

Data drift is potentially dangerous to a pipeline. A toy example we’ll refer to throughout this lesson is a simple computer vision model that recognizes and classifies dogs. The input to the model is various images (some with dogs, some without), and the output is a 0 or a 1 marking the absence or presence of a dog. Let’s assume that this model will be deployed for centuries.

Types of drift

There are many ways that models start becoming “out of style.” We consider shift of inputs, outputs, and more.

Covariate shift

A covariate shift occurs when the input data changes over time. Specifically, the independent variables (e.g., the images with and without dogs) are fundamentally different over time. An illustration of this with our toy example is the following:

The training data for the model is sourced from 2000–2023. The model is deployed for an incredibly long time, but photography fundamentally changes! Images from 2060 are fundamentally clearer and more detailed than images from 2023. Therefore, the model might suffer accuracy losses because the inputs have changed. Note that the images still contain dogs (or no dogs for negative examples).

Mathematically, covariate shift is defined as:

Get hands-on with 1200+ tech skills courses.