Splitting Time Series Data

Learn how to split time series data for training, testing, and validation.

Motivation

When we develop machine learning models, we split our data into train, validation, and test sets to evaluate their ability to generalize over unseen data. Most techniques used to divide the data into these sets have something in common: The split is random. In other words, data points are assigned to either one of the three sets randomly. Contrary to standard practice, in time series, we cannot do that. This is due to the sequential nature of the time series.

Time series forecasting is based on the principle that the future will be similar to the past. This principle, however, would be broken if we trained our models on randomly selected data points. The reason is that we could end up training a model using data that happened after the data that we are going to test it on. This is a type of situation called data leakage. To put it in a different way, we might be using tomorrow’s temperature to predict yesterday’s temperature, which is obviously impossible in real life. Our model would never encounter this situation in a production scenario.

Sequential split with cut-off points

We can still avoid the dangers of data leakage and apply a rigorous split strategy. The easiest way to split the data is to select a cut-off point. Points prior to that point will go to the train set; points after the cut-off will be left out for testing. A similar logic applies if we want to create a validation set with two cut-off points, as shown in the diagram below.

Get hands-on with 1200+ tech skills courses.