How to implement xgb.cv() in Python
XGBoost (eXtreme gradient boosting) is a well-known and strong machine-learning library commonly used for classification, regression, and cross-validation of applications.
The xgb.cv() function
Cross-validation helps evaluate machine learning models by testing the model's performance on unknown data while avoiding
The xgb.cv() function runs k-fold cross-validation on a given dataset to effectively estimate model performance and modify hyperparameters. By averaging the performance across multiple folds, it reduces the impact of data randomness.
Syntax
Here is the basic syntax of the function xgb.cv():
xgb.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, metrics=(),obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None,as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)
paramsis a required parameter representing a dictionary of XGBoost hyperparameters.dtrainis a required parameter representing the DMatrix training data.num_boost_roundis a required parameter representing the number of boosting rounds (iterations).nfoldis a required parameter representing the number of folds for cross-validation.early_stopping_roundsis a required parameter, if specified, will stop the training early if the performance does not improve for this many rounds.seedis a required parameter representing a random seed for reproducibility.metricsis an optional parameter representing a tuple or list of evaluation metrics to use during cross-validation.stratifiedis an optional parameter that tells whether to perform stratified sampling for cross-validation.objis an optional parameter representing a custom objective function to be optimized during training.fevalis an optional parameter representing a custom evaluation function used to calculate additional evaluation metrics during training.maximizeis an optional parameter that tells whether to maximize the evaluation metric.fpreprocis an optional parameter representing a function that preprocesses data before training, which is used to modify the DMatrix.as_pandasis an optional parameter that tells whether to return the cross-validation results as a pandas DataFrame.verbose_evalis an optional parameter that controls the verbosity of the evaluation results.show_stdvis an optional parameter that tells whether to display the standard deviation of evaluation results during cross-validation.callbacksis an optional parameter representing custom callback functions that can be used to customize the training process.shuffleis an optional parameter that tells us whether to shuffle the data before splitting it into folds for cross-validation.
Note: Make sure you have the XGBoost library installed. Learn more about the error-free XGBoost installation on your system here.
Code
Let's demonstrate the use of xgb.cv() with the following code sample:
import xgboost as xgbimport numpy as np#Creating a smaller synthetic datasetnp.random.seed(42)X = np.random.rand(50, 3)y = np.random.randint(0, 2, 50)#Converting the data to DMatrixdata = xgb.DMatrix(X, label=y)#Hyperparametersparams = {'objective': 'binary:logistic','max_depth': 3,'learning_rate': 0.1,}#Performing cross-validationcv_results = xgb.cv(params, data, num_boost_round=10, nfold=3,metrics='logloss', seed=42)#Printing the resultsprint(cv_results)
Code explanation
Line 1–2: Firstly, we import the necessary
xgbandnpmodules.Line 5–7: Now, we create a smaller synthetic dataset with 50 samples and 3 features for our convenience using
random.rand()andrandom.randint()functions. The variableyis binary, having values 0 or 1.Line 10: Now, we use
xgb.DMatrix()to convert the numpy arraysXandyinto a DMatrix nameddata.Line 13–17: We create a dictionary named
paramscontaining the hyperparameters for our XGBoost model. We set the objective tobinary:logisticregression, the maximum depth of each tree to 3, and the learning rate to 0.1.Line 20: Here, we call the
xgb.cv()function with the specified hyperparameters, data, and other parameters likenum_boost_round=10, the number of cross-validation foldsnfold=3, and the evaluation metric set tologloss.Line 24: Finally, we print the
cv_resultsthat return a DataFrame containing the cross-validation results on the console.
Output
Upon execution, the code will display a table containing cross-validation results for each boosting round and evaluate the model’s performance using log loss.
The output looks something like this:
train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std0 0.636732 0.002798 0.672943 0.0100681 0.602735 0.004768 0.647836 0.0155432 0.576188 0.005845 0.628812 0.0182023 0.553682 0.006804 0.613275 0.0214824 0.535220 0.006787 0.600717 0.0234615 0.518873 0.006657 0.590129 0.0246236 0.504942 0.006400 0.580991 0.0253397 0.492948 0.006396 0.573358 0.0259068 0.482799 0.006090 0.567531 0.0264549 0.473649 0.006044 0.563117 0.026462
We can see that the table has four columns that show the log loss values for the training and test sets at each boosting round. The model's performance and variance are estimated using the mean and standard deviation over several cross-validation folds.
Conclusion
Overall, the xgb.cv() function is a useful tool for cross-validation to evaluate the performance of XGBoost models. It provides significant insights into the model's performance and choices for adjusting the number of boosting rounds and folds. We can develop strong and accurate machine-learning models with XGBoost by offering numerous evaluation metrics and options.
Free Resources