Quick Overview of Split Validation

An introduction to train, validate and test data split using sklearn.

Definition: Split validation

A crucial part of machine learning is partitioning the data into two separate sets using a technique called split validation.

  • The first set is called the training data and is used to build the prediction model.

  • The second set is called the test data and is kept in reserve to assess the model’s accuracy developed from the training data.

  • The training and test data is typically split 70/30 or 80/20, with the training data representing the larger portion. Once the model has been optimized and validated against the test data for accuracy, it’s ready to generate predictions using new input data.

Although the model is used on both the training and test sets, it’s from the training data alone that the model is built.

The test data is used as input to form predictions and assess the model’s accuracy, but it is never decoded and should not be used to create the model. Since the test data cannot be used to build and optimize the model, data scientists sometimes use a third independent dataset called the validation set.

After building an initial model with the training set, the validation set can be fed to the prediction model and used as feedback to optimize the model’s hyperparameters. The test set is then used to assess the prediction error of the final model.

To maximize data utility, it is possible to reuse the validation and test data as training data. This would involve bundling the used data with the original training data to optimize the model just before it’s put into use.

However, once the original validation or test set has been used for training, it can no longer be used as a validation or test set.

Train and test set

To perform split validation in Python, you can use train_test_split from Scikit-learn, which requires an initial import from the sklearn.model_selection library.

from sklearn.model_selection import train_test_split

Before using this code library, you first need to set our x and y values.

import pandas as pd
df = pd.read_csv('~/Downloads/advertising.csv')
X = df[['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage', 'Ad Topic Line', 'Country']]
y = df['Clicked on Ad']

You are now ready to create our training and test data using the following parameters: train_size (optional), test_size, random_state (optional), and shuffle (optional).

Get hands-on with 1200+ tech skills courses.