Cross Validation
Explore how to split datasets into training, validation, and test sets to evaluate model performance effectively. Understand cross validation methods such as k-fold and Leave-One-Out to reduce overfitting and make better use of data, ensuring more reliable machine learning results.
Train, test, and validation datasets
We divide the dataset at hand into a training and a test dataset:
-
We train the model on the training dataset and evaluate its performance.
-
We evaluate the model’s performance on the test dataset (on which the model is not trained) and report the performance of the model.
-
Scikit-learn provides
train_test_split, which gives us the training and test datasets. These code snippets have been taken from the scikit-learn documentation itself.
- Line 6 imports the iris dataset and saves the input columns in
Xand the output column iny.
- Lines 8 and 9 print the shape of the dataset.
-
Line 11 splits the dataset into the training and the test datasets.
test_sizespecifies the percentage of instances to be kept in the test dataset. In the current case, 40% of the rows are kept in the test dataset. -
Then we print the shape of the newly formed datasets.
Validation dataset
When evaluating different settings, hyperparameters for models, such as the (learning rate) setting that must be manually set for a Ridge regression, there is still a risk of overfitting ...