How to implement cross_val_predict in sklearn
Scikit-learn is a popular Python open-source machine-learning library. It provides tools and functions for data preprocessing, modeling, and evaluation. The most essential phase of creating a robust machine learning model is to effectively evaluate its performance. To evaluate our model on multiple data points, a.k.a cross-validated predictions, Scikit-learn provides the cross_val_predict function. In this Answer, we'll explore cross_val_predict and its step-by-step implementation.
Understanding cross_val_predict
cross_val_predict is a function that generates cross-validated predictions for each data point of our dataset. It works with a technique that splits the data into multiple training and testing subsets, trains the model, and then makes predictions on the testing subset. The process repeats depending on the number of cross-validations we have set.
The cross_val_predict function not only evaluates the model's performance but also provides predictions for each data point, providing a better understanding of the model's behavior and weaknesses. In the illustration below, we can see how the cross_val_predict splits data into training and testing data if the number of cross-validations is set to 5.
Syntax
The syntax to use cross_val_predict is:
cross_val_predict(estimator , X , y , cv , n_jobs , verbose , fit_params , pre_dispatch , method)
estimator: The object that implements ‘fit’ and ‘predict.’X: The features data array to fit.y: The target array for prediction and training. Default = None.groups: An array of group identifiers used in combination with a group-based technique (e.g.,GroupKFold). It is used for the sample dataset while dividing it into training and testing sets.cv: An integer value that determines the number of iterations in which the train-test splits are to be made.n_jobs: It is the number of jobs to run in parallel.Nonemeans 1 unless in ajoblib.parallel_backendcontext.-1means to use all processors.verbose: It sets the level. Default = 0.verbosity Explaining the steps in words. fit_params: A dictionary of parameters to be passed to the estimator's fit method.pre_dispatch: By default, its value is 2*n_jobs. It manages the amount of dispatched jobs during parallel execution. By decreasing this quantity, we can prevent excessive memory usage caused by dispatching more tasks than the available CPUs can handle.method: The methods can bepredict,predict_proba,predict_log_proba, anddecision_function. By default its value ispredict.
Steps to implement cross_val_predict
Now that we have had a clear understanding of cross_val_predict, we will walk through the steps for its implementation:
1. Import the necessary libraries
Before we can use cross_val_predict, we need to import the required libraries from sklearn:
from sklearn.model_selection import cross_val_predictfrom sklearn.linear_model import Ridgeimport numpy as npimport pandas as pd
We have imported cross_val_predict from sklearn's modelselection module. We will be using Ridge regression in this example.
2. Load and prepare the data
Now we will import the data we want to apply our machine learning model. For that, we will import the California housing data set from Sklearn.
from sklearn.model_selection import cross_val_predictfrom sklearn.linear_model import Ridgeimport numpy as npimport pandas as pdfrom sklearn.datasets import fetch_california_housinghousing_sk_data = fetch_california_housing()housing_df = pd.DataFrame(housing_sk_data["data"] , columns=housing_sk_data["feature_names"])housing_df["target"]=housing_sk_data["target"]x = housing_df.drop("target" ,axis=1)y = housing_df["target"]
After importing the data, we prepare our feature matrix (x) and target vector (y).
3. Create an estimator
We will instantiate the machine learning model we want to use. As said earlier, we'll use a RidgeRegression model:
model = RidgeRegression()
5. Generate cross-validated predictions
Now, we can use the cross_val_predict function to generate cross-validated predictions:
cross_val_predictions = cross_val_predict(model, x, y , cv = 5)
We set the cross validation/iterable (cv) to 5 which means that the model will be trained and tested on 5 different subsets of the dataset.
The main difference between
predictandcross_val_predictis only trained on a single subset of the dataset, whereascross_val_predictis trained and tested on all the data set in muliple intervals (depending oncv).
6. Analyze the predictions
Now that we have successfully trained and tested our data using the cross_val_predict,we can analyze the predictions to understand the model's performance better. For instance, we can identify data points where the model consistently performs well or poorly.
Complete code
The complete code can be seen and executed by clicking the Run button below:
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import Ridge
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
housing_sk_data = fetch_california_housing()
housing_df = pd.DataFrame(housing_sk_data["data"] , columns=housing_sk_data["feature_names"])
housing_df["target"]=housing_sk_data["target"]
x = housing_df.drop("target" ,axis=1)
y = housing_df["target"]
model = Ridge()
cross_val_predictions = cross_val_predict(model, x, y , cv = 5)
print(cross_val_predictions)Benefits of cross_val_predict
cross_val_predict offers several advantages:
Insight into model performance: By obtaining predictions for each data point, we can gain a deeper understanding of where the model works fine and where it struggles.
Data efficiency: It ensures data efficiency as each data point is utilized for training and testing, which maximizes the dataset's use.
Effective evaluation: We can assess the model's performance more accurately than a single train-test split.
Conclusion
The cross_val_predict function provided by sklearn is a powerful tool for evaluating machine learning models by providing cross-validated predictions. By following the steps explained in this Answer, we can implement cross_val_predict. This enables us to gain insights into the model's behavior across different data subsets so that we may improve it.
Free Resources