Cross Validation

Explore how to split datasets into training, validation, and test sets to evaluate model performance effectively. Understand cross validation methods such as k-fold and Leave-One-Out to reduce overfitting and make better use of data, ensuring more reliable machine learning results.

We'll cover the following...

Train, test, and validation datasets
- Validation dataset
  - Drawback
  - Solution
What is cross-validation?

Python 3.5

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
print("Original Shape of input and output columns")
print(X.shape) 
print(y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
print("Shape of the training dataset's input and output columns")
print(X_train.shape)
print(y_train.shape)
print("Shape of the test dataset's input and output columns")
print(X_test.shape) 
print(y_test.shape)

Line 6 imports the iris dataset and saves the input columns in X and the output column in y.

Lines 8 and 9 print the shape of the dataset.

Line 11 splits the dataset into the training and the test datasets. test_size specifies the percentage of instances to be kept in the test dataset. In the current case, 40% of the rows are kept in the test dataset.
Then we print the shape of the newly formed datasets.

Validation dataset

When evaluating different settings, hyperparameters for models, such as the $\alpha$ (learning rate) setting that must be manually set for a Ridge regression, there is still a risk of overfitting ...

1.What Is Data Science ?

2.Applications of Data Science

3.Overview of Libraries

4.Probability and Statistics

5.Machine Learning Part-1

6.Machine Learning Part-2

7.Machine Learning Part-3

8.Deep Learning

9.Machine Learning Tools and Libraries

10.Big Data Tools and Technologies

11.Where to go next ?

Mock Interview

Mock Interview

Cross Validation

Train, test, and validation datasets

Validation dataset