In scikit-learn, cross-validation splits the dataset into multiple folds, trains the model on some folds, and tests it on the remaining folds. This process repeats, and the results are averaged to assess model performance.
How to perform cross-validation using scikit-learn
Key takeaways:
Cross-validation evaluates a machine learning model’s ability to generalize to unseen data.
Scikit-learn offers two main functions for cross-validation:
cross_val_scoreandcross_validate.
cross_val_scoreuses a single metric to evaluate the model across multiple data splits.
cross_validateallows using multiple metrics for model evaluation.Both functions help assess the model’s performance and consistency across different datasets.
Higher cross-validation scores indicate better model generalization.
Cross-validation is a machine learning technique used to evaluate the generalization ability and quality of the models undergoing training. It helps assess the model’s capability to run on unseen data. Scikit-learn, also known as sklearn, is an open-source Python library for making and evaluating machine-learning models.
In this Answer, we will learn how the sklearn Python library performs cross-validation on machine learning models and the benefits of doing so. We’ll analyze the functions that perform cross-validation on datasets.
sklearn library has many different approaches to performing cross-validation on machine learning models. The functions we’ll be discussing are cross_val_score and cross_validate.
The cross_val_score function
The cross_val_score function performs cross-validation on the dataset and the estimator of the machine learning model under training and testing. An estimator is an object that represents the machine-learning model being trained. The dataset represents the collection of data on which the model is trained and tested.
The cross_val_score function can take five arguments. The description of the arguments is as follows:
Estimator instance: An estimator instance of the model being trained.
Dataset features matrix: A 2D matrix having features and data points.
Dataset labels: The labels the model is trying to predict.
Iterator: If integers, it represents total iterations, each with different splits. It is represented by
cv.scoring: The metric for performing cross-validation. The
scoremethod of the estimator is used by default. To change the method, specify it as thescoringparameter.
The example below shows how the cross_val_score uses a single metric r2, in its cross-validation process. The r2 metric specifies the generalization capability of the linear regression model.
import numpy as npfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import recall_score# Sample features and label datax = np.array([[1000, 1556, 6566], [2066, 2445, 7665], [3450, 3325, 8365], [4130, 4465, 9524], [5023, 5465, 9645]])y = np.array([2230, 3560, 6405, 7560, 9302])# Instance of linear regressionlri = LinearRegression()# Cross-validation with estimator with 2 iterationsscores = cross_val_score(lri, X, y, cv=2, scoring='r2')print(f"test_r2: {scores.mean():.2f} with standard deviation {scores.std():.2f}")
Explanation
Lines 1–4: Import
numpyand useLinearRegression,cross_val_score, andrecall_scorefromsklearn.Lines 8–9: Define datasets
Xandy. For linear regression, the data should be in linear form.Line 12: Define
lrias the instance of the estimator,LinearRegression.Line 15: Calculate the
scoreswithcross_val_scoreusing two iterations and metricr2.Line 17: Print the
scoresmean and standard deviation.
The cross_validate function
The cross_validate function of the sklearn library helps us to specify multiple metrics while training and testing the model. While the scoring perimeter in cross_val_score was a string of metric names, in cross_validate, it is an array of strings having multiple metric names specified to the scoring perimeter.
The cross_validate function has the same parameters as the cross_val_score. Here is a demonstration of how to use multiple metrics to test the model.
import numpy as npfrom sklearn.model_selection import cross_validatefrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import recall_score# Sample features and label datax = np.array([[1000, 1556, 6566], [2066, 2445, 7665], [3450, 3325, 8365], [4130, 4465, 9524], [5023, 5465, 9645]])y = np.array([2230, 3560, 6405, 7560, 9302])# Instance of linear regressionlri = LinearRegression()scoring = ['r2', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error']# Cross-validation with estimator with 2 iterationsscores = cross_validate(lri, x, y, cv=2, scoring=scoring)print(f"test_r2: {scores['test_r2'].mean():.2f} with standard deviation {scores['test_r2'].std():.2f}")print(f"neg_mean_absolute_error: {scores['test_neg_mean_absolute_error'].mean():.2f} with standard deviation {scores['test_neg_mean_absolute_error'].std():.2f}")print(f"neg_mean_absolute_percentage_error: {scores['test_neg_mean_absolute_percentage_error'].mean():.2f} with standard deviation {scores['test_neg_mean_absolute_error'].std():.2f}")
Explanation
Lines 1–4: Import
numpyand useLinearRegression,cross_val_score, andrecall_scorefromsklearn.Lines 8–9: Define datasets
Xandy. For linear regression, the data should be in linear form.Line 12: Define
lrias the instance of the estimator,LinearRegression.Line 14: Define the
scoringarray to hold the names of the metricsr2,neg_mean_absolute_error, andneg_mean_absolute_percentage_error.Line 16: Calculate the
scoreswithcross_val_scoreusing two iterations and thescoringarray metrics.Lines 18–20: Print the key and values of the
scorearray. The keys against which values are calculated aretest_r2,test_neg_mean_absolute_error, andtest_neg_mean_absolute_error.
Conclusion
To sum up, two functions perform basic cross-validation on a dataset. The cross_val_score function takes a single metric to train the data against. On the other hand, the cross_validate function takes multiple metrics in the form of an array to train the data. The method of choosing the cross-validation to work for the data is to decide the metrics to train the data. The results of these functions help us evaluate the generalization ability of the model being trained. The higher the result, the more likely the model is to work on various datasets.
Frequently asked questions
Haven’t found what you were looking for? Contact Us
How does cross-validation work in scikit-learn?
What are the steps of cross-validation?
Which cross-validation is best?
Free Resources