Validation
Get introduced to the importance of data splits and the process of cross-validation.
Since regularization is a method to fine-tune the subject model by introducing an additional penalty in the error function, we need to validate its impact. Several hyperparameters need to be set before optimizing the objective function. The hyperparameters include model , loss function , regularization function , and the scale of regularization . Validation is the process of testing the accuracy of the trained model, which also measures the validity of the hyperparameters.
Note: An accurate indicator of generalization is the performance of the trained model on
. unseen data This is the data that isn’t used in the training process.
Data splits
Where to get the unseen data for validation? One way is to hold out a percentage of available data and use the rest for training. Once the training is complete, the validation can be carried out on the subset of available data that was kept for validation, known as the hold-out set.
Note: The more popular term used for hold-out set is test set.
How large should the test set be? To assess the generalization, we need the test set to be large. But we also need the training set to be large to avoid overfitting. There’s no exact workaround to this trade-off. A rule of thumb, however, is to use an
To improve the performance after validation, the hyperparameters can be tuned. The validation and tuning cycle continues until the desired performance is achieved.
Validation set
After validation on the test set, if the hyperparameters are tuned and the training is carried out again, the test set is no longer unseen. It’s used in the training process but not directly as the training set.
If the test set must be unseen, then how to tune the hyperparameters?
A compromise in this situation is to make another split of the data called the validation set and use it for tuning the hyperparameters. This way, the test set can be kept unseen, and the performance of the final model can be reported on it.
Implementing data splits
It’s handy to have an understanding of how a dataset can be randomly split into three splits. In the code below, the variable X contains feature vectors as rows, and the variable y contains the targets as rows (assuming multi-target modeling).
Using numpy
There can be several ways of doing this data split; one way is to randomly permute the indices of the data points and then pick the percentages. This is easy to understand and implement. A handy way to do a validation split is to split the dataset into two splits, train and test, and then split the test set to get a validation set. The code below implements this idea:
Here is the explanation for the code above:
-
Lines 4–7: We define the
split_datafunction that splits the input arraysXandyinto training and testing sets, where the testing set size is determined by thetest_sizeargument (default ...