How to perform cross-validation using scikit-learn

Key takeaways:
Cross-validation evaluates a machine learning model’s ability to generalize to unseen data.
Scikit-learn offers two main functions for cross-validation: cross_val_score and cross_validate.
cross_val_score uses a single metric to evaluate the model across multiple data splits.
cross_validate allows using multiple metrics for model evaluation.
Both functions help assess the model’s performance and consistency across different datasets.
Higher cross-validation scores indicate better model generalization.

Cross-validation is a machine learning technique used to evaluate the generalization ability and quality of the models undergoing training. It helps assess the model’s capability to run on unseen data. Scikit-learn, also known as sklearn, is an open-source Python library for making and evaluating machine-learning models.

In this Answer, we will learn how the sklearn Python library performs cross-validation on machine learning models and the benefits of doing so. We’ll analyze the functions that perform cross-validation on datasets.

sklearn library has many different approaches to performing cross-validation on machine learning models. The functions we’ll be discussing are cross_val_score and cross_validate.

The `cross_val_score` function

The cross_val_score function performs cross-validation on the dataset and the estimator of the machine learning model under training and testing. An estimator is an object that represents the machine-learning model being trained. The dataset represents the collection of data on which the model is trained and tested.

The cross_val_score function can take five arguments. The description of the arguments is as follows:

Estimator instance: An estimator instance of the model being trained.
Dataset features matrix: A 2D matrix having features and data points.
Dataset labels: The labels the model is trying to predict.
Iterator: If integers, it represents total iterations, each with different splits. It is represented by cv.
scoring: The metric for performing cross-validation. The score method of the estimator is used by default. To change the method, specify it as the scoring parameter.

The example below shows how the cross_val_score uses a single metric r2, in its cross-validation process. The r2 metric specifies the generalization capability of the linear regression model.

import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import recall_score
# Sample features and label data
x = np.array([[1000, 1556, 6566], [2066, 2445, 7665], [3450, 3325, 8365], [4130, 4465, 9524], [5023, 5465, 9645]])
y = np.array([2230, 3560, 6405, 7560, 9302])
# Instance of linear regression
lri = LinearRegression()
# Cross-validation with estimator with 2 iterations
scores = cross_val_score(lri, X, y, cv=2, scoring='r2')
print(f"test_r2: {scores.mean():.2f} with standard deviation {scores.std():.2f}")

Explanation

Lines 1–4: Import numpy and use LinearRegression, cross_val_score, and recall_score from sklearn.
Lines 8–9: Define datasets X and y. For linear regression, the data should be in linear form.
Line 12: Define lri as the instance of the estimator, LinearRegression.
Line 15: Calculate the scores with cross_val_score using two iterations and metric r2.
Line 17: Print the scores mean and standard deviation.

The `cross_validate` function

The cross_validate function of the sklearn library helps us to specify multiple metrics while training and testing the model. While the scoring perimeter in cross_val_score was a string of metric names, in cross_validate, it is an array of strings having multiple metric names specified to the scoring perimeter.

The cross_validate function has the same parameters as the cross_val_score. Here is a demonstration of how to use multiple metrics to test the model.

import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.metrics import recall_score
# Sample features and label data
x = np.array([[1000, 1556, 6566], [2066, 2445, 7665], [3450, 3325, 8365], [4130, 4465, 9524], [5023, 5465, 9645]])
y = np.array([2230, 3560, 6405, 7560, 9302])
# Instance of linear regression
lri = LinearRegression()
scoring = ['r2', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error']
# Cross-validation with estimator with 2 iterations
scores = cross_validate(lri, x, y, cv=2, scoring=scoring)
print(f"test_r2: {scores['test_r2'].mean():.2f} with standard deviation {scores['test_r2'].std():.2f}")
print(f"neg_mean_absolute_error: {scores['test_neg_mean_absolute_error'].mean():.2f} with standard deviation {scores['test_neg_mean_absolute_error'].std():.2f}")
print(f"neg_mean_absolute_percentage_error: {scores['test_neg_mean_absolute_percentage_error'].mean():.2f} with standard deviation {scores['test_neg_mean_absolute_error'].std():.2f}")

Explanation

Lines 1–4: Import numpy and use LinearRegression, cross_val_score, and recall_score from sklearn.
Lines 8–9: Define datasets X and y. For linear regression, the data should be in linear form.
Line 12: Define lri as the instance of the estimator, LinearRegression.
Line 14: Define the scoring array to hold the names of the metrics r2, neg_mean_absolute_error, and neg_mean_absolute_percentage_error.
Line 16: Calculate the scores with cross_val_score using two iterations and the scoring array metrics.
Lines 18–20: Print the key and values of the score array. The keys against which values are calculated are test_r2, test_neg_mean_absolute_error, and test_neg_mean_absolute_error.

Conclusion

To sum up, two functions perform basic cross-validation on a dataset. The cross_val_score function takes a single metric to train the data against. On the other hand, the cross_validate function takes multiple metrics in the form of an array to train the data. The method of choosing the cross-validation to work for the data is to decide the metrics to train the data. The results of these functions help us evaluate the generalization ability of the model being trained. The higher the result, the more likely the model is to work on various datasets.

Frequently asked questions

Haven’t found what you were looking for? Contact Us

How does cross-validation work in scikit-learn?

In scikit-learn, cross-validation splits the dataset into multiple folds, trains the model on some folds, and tests it on the remaining folds. This process repeats, and the results are averaged to assess model performance.

What are the steps of cross-validation?

Split the data into training and validation sets.
Train the model on the training set.
Test the model on the validation set.
Repeat this process for multiple folds.
Average the results across all folds.

Which cross-validation is best?

K-Fold cross-validation is widely used for most cases due to its balance between bias and variance. However, for smaller datasets, Leave-One-Out (LOO) cross-validation might be more appropriate.

Free Resources

How to perform cross-validation using scikit-learn

The cross_val_score function

Explanation

The cross_validate function

Explanation

Conclusion

Frequently asked questions

How does cross-validation work in scikit-learn?

What are the steps of cross-validation?

Which cross-validation is best?

The `cross_val_score` function

The `cross_validate` function