How to implement xgb.cv() in Python

XGBoost (eXtreme gradient boosting) is a well-known and strong machine-learning library commonly used for classification, regression, and cross-validation of applications.

The `xgb.cv()` function

Cross-validation helps evaluate machine learning models by testing the model's performance on unknown data while avoiding overfittingOverfitting occurs when a machine learning model performs well on the training data but poorly on unseen data, indicating it has memorized the training set and lacks generalization..

The xgb.cv() function runs k-fold cross-validation on a given dataset to effectively estimate model performance and modify hyperparameters. By averaging the performance across multiple folds, it reduces the impact of data randomness.

Syntax

Here is the basic syntax of the function xgb.cv():

params is a required parameter representing a dictionary of XGBoost hyperparameters.
dtrain is a required parameter representing the DMatrix training data.
num_boost_round is a required parameter representing the number of boosting rounds (iterations).
nfold is a required parameter representing the number of folds for cross-validation.
early_stopping_rounds is a required parameter, if specified, will stop the training early if the performance does not improve for this many rounds.
seed is a required parameter representing a random seed for reproducibility.
metrics is an optional parameter representing a tuple or list of evaluation metrics to use during cross-validation.
stratified is an optional parameter that tells whether to perform stratified sampling for cross-validation.
obj is an optional parameter representing a custom objective function to be optimized during training.
feval is an optional parameter representing a custom evaluation function used to calculate additional evaluation metrics during training.
maximize is an optional parameter that tells whether to maximize the evaluation metric.
fpreproc is an optional parameter representing a function that preprocesses data before training, which is used to modify the DMatrix.
as_pandas is an optional parameter that tells whether to return the cross-validation results as a pandas DataFrame.
verbose_eval is an optional parameter that controls the verbosity of the evaluation results.
show_stdv is an optional parameter that tells whether to display the standard deviation of evaluation results during cross-validation.
callbacks is an optional parameter representing custom callback functions that can be used to customize the training process.
shuffle is an optional parameter that tells us whether to shuffle the data before splitting it into folds for cross-validation.

Note: Make sure you have the XGBoost library installed. Learn more about the error-free XGBoost installation on your system here.

Code

Let's demonstrate the use of xgb.cv() with the following code sample:

Code explanation

Line 1–2: Firstly, we import the necessary xgb and np modules.
Line 5–7: Now, we create a smaller synthetic dataset with 50 samples and 3 features for our convenience using random.rand() and random.randint() functions. The variable y is binary, having values 0 or 1.
Line 10: Now, we use xgb.DMatrix() to convert the numpy arrays X and y into a DMatrix named data.
Line 13–17: We create a dictionary named params containing the hyperparameters for our XGBoost model. We set the objective to binary:logistic regression, the maximum depth of each tree to 3, and the learning rate to 0.1.
Line 20: Here, we call the xgb.cv() function with the specified hyperparameters, data, and other parameters like num_boost_round=10, the number of cross-validation folds nfold=3, and the evaluation metric set to logloss.
Line 24: Finally, we print the cv_results that return a DataFrame containing the cross-validation results on the console.

Output

Upon execution, the code will display a table containing cross-validation results for each boosting round and evaluate the model’s performance using log loss.

The output looks something like this:

   train-logloss-mean  train-logloss-std  test-logloss-mean  test-logloss-std
0            0.636732           0.002798           0.672943          0.010068
1            0.602735           0.004768           0.647836          0.015543
2            0.576188           0.005845           0.628812          0.018202
3            0.553682           0.006804           0.613275          0.021482
4            0.535220           0.006787           0.600717          0.023461
5            0.518873           0.006657           0.590129          0.024623
6            0.504942           0.006400           0.580991          0.025339
7            0.492948           0.006396           0.573358          0.025906
8            0.482799           0.006090           0.567531          0.026454
9            0.473649           0.006044           0.563117          0.026462

We can see that the table has four columns that show the log loss values for the training and test sets at each boosting round. The model's performance and variance are estimated using the mean and standard deviation over several cross-validation folds.

Conclusion

Overall, the xgb.cv() function is a useful tool for cross-validation to evaluate the performance of XGBoost models. It provides significant insights into the model's performance and choices for adjusting the number of boosting rounds and folds. We can develop strong and accurate machine-learning models with XGBoost by offering numerous evaluation metrics and options.