Detecting Data Drift
Explore various methods for detecting data drift in machine learning pipelines, including statistical tests like the Kolmogorov-Smirnov test and algorithmic approaches such as Page-Hinkley and DDM. Understand how to identify shifts in data distributions to maintain model performance and prevent bias.
Data drift is potentially harmful to an ML algorithm in deployment. As the underlying data changes, the predictions can become skewed—or worse, biased. In this lesson, we cover commonly used theoretical methods for identifying data drift.
Statistical methods
Statistical methods tend to be fast and low-lift. They’re simple mathematical formulations relying on hypothesis tests to detect drift at some confidence level.
Kolmogorov-Smirnov
The two-sample Kolmogorov-Smirnov (KS) test is a statistical hypothesis test with the following hypotheses:
: The two samples come from the same distribution. : The two samples are drawn from different distributions.
For two samples of size