Trusted answers to developer questions

Data leakage in machine learning

Free System Design Interview Course

Many candidates are rejected or down-leveled due to poor performance in their System Design Interview. Stand out in System Design Interviews and get hired in 2024 with this popular free course.

Data leakage is a phenomenon that occurs when your model learns from data that shouldn’t be a part of the training data set or data that wouldn’t be available in a real-life​ scenario. It is most​ common when your data set already has the information that you’re trying to predict.

Time series forecasting

Data leakage is a common phenomenon in time series forecasting, i.e., where the data points follow a chronological order.

Depending on the nature of the data set, it is possible that the target variable has a distribution that is very similar for both data sets (the training and the test). However, such a case may not hold true in real-life scenarios. The model can learn how the probability of each target variable changes according to the moment in time. Thus, any feature included in the data set, that is related to time, may be​ a potential threat of data leakage.

Therefore, the first approach to counter data leakage in time series forecasting is to remove all the features that relate to time.

RELATED TAGS

data
leakage
data sciences
machine learning
Copyright ©2024 Educative, Inc. All rights reserved
Did you find this helpful?