How to implement xgb.DMatrix() in Python

XGBoost (eXtreme gradient boosting) is a popular open-source machine-learning library known for its remarkable performance in various machine-learning tasks.

The `xgb.DMatrix()` function

The xgboost.DMatrix() function creates a specialized data structure called DMatrix (short for Data Matrix). This data structure is optimized for memory efficiency and faster computation, making it ideal for large-scale datasets.

Once the DMatrix is created, it can be used directly in training with XGBoost algorithms like classification or regression tasks. It can also be used in cross-validation and hyperparameter tuning of modules.

Syntax

The syntax of the xgb.DMatrix() function is given below:

data is a required parameter representing the input data for the DMatrix.
label is an optional parameter that shows the target labels for the training data.
weight is an optional parameter representing the weight for each instance.
base_margin is an optional parameter that specifies the initial prediction score for the model.
missing is an optional parameter representing the missing value in the data. By default, it is set to None.
silent is an optional parameter on whether to print messages during DMatrix creation. It is set to False by default.
feature_names is an optional parameter representing a list of feature names, which will be used to name the columns of the DMatrix.
feature_types is an optional parameter representing a list of strings that specify the types of features. It can be 'int', 'float', 'i', 'q', 'u', or 's'.
nthread is an optional parameter representing the number of threads for converting data. If not specified, the maximum number of available threads will be used.
group is an optional parameter representing the group or query ID for ranking tasks.
qid is an optional parameter representing a query ID for ranking tasks, similar to the group parameter.
label_lower_bound is an optional parameter representing the lower bound of the label values.
label_upper_bound is an optional parameter representing the upper bound of the label values.
feature_weights is an optional parameter representing a weight for each feature.
enable_categorical is an optional parameter. If set to True, categorical features are treated as such during training and prediction.
data_split_mode is an optional parameter specifying how data splits are performed when using different data containers. The default is DataSplitMode.ROW.

Note: Make sure you have the XGBoost library installed. Learn more about the error-free XGBoost installation on your system here.

Code

Let's illustrate the use of xgb.DMatrix() with a basic code example using the diabetes dataset:

import xgboost as xgb
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
#Loading the diabetes dataset
data = load_diabetes()
X, y = data.data, data.target
#Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Creating a DMatrix for training and testing data
D_train = xgb.DMatrix(X_train, label=y_train)
D_test = xgb.DMatrix(X_test, label=y_test)
#Printing basic information about the DMatrix
print("Number of training samples in DMatrix:", D_train.num_row())
print("Number of features in Dmatrix:", D_train.num_col())

Code explanation

Line 1–3: Firstly, we import the necessary modules. The xgb, pd modules and load_diabetes from the sklearn.datasets module to load the dataset.
Line 4: Next, we import the train_test_split function from the sklearn.model_selection module to split the dataset into training and test sets.
Line 7: Now, we fetch and store the diabetes dataset in the data variable.
Line 8: We separate the features X and target labels y from the loaded dataset in this line.
Line 11: Here, we split the data into training and test sets using train_test_split. It takes the features X and target labels y as input and splits them. The test set size is 0.2, which makes 20% of the whole dataset, and the random state is 42 to provide consistency.
Line 14–15: In these lines, we create two instances of xgb.DMatrix(). It takes the feature data X and the target variable y as arguments for training and testing data separately.
Line 18–19: Finally, we print the number of samples and features in the DMatrix using the num_row() method and the num_col() method of the DMatrix object.

Output

Upon execution, the code will show the number of samples and features in the DMatrix created from the diabetes dataset.

The output looks like this: