Data leakage in machine learning

Data leakage is a phenomenon that occurs when your model learns from data that shouldn’t be a part of the training data set or data that wouldn’t be available in a real-life scenario. It is most common when your data set already has the information that you’re trying to predict.

Depending on the nature of the data set, it is possible that the target variable has a distribution that is very similar for both data sets (the training and the test). However, such a case may not hold true in real-life scenarios. The model can learn how the probability of each target variable changes according to the moment in time. Thus, any feature included in the data set, that is related to time, may be a potential threat of data leakage.

Therefore, the first approach to counter data leakage in time series forecasting is to remove all the features that relate to time.

Data leakage in machine learning

Time series forecasting