How to implement xgb.DMatrix() in Python
XGBoost (eXtreme gradient boosting) is a popular open-source machine-learning library known for its remarkable performance in various machine-learning tasks.
The xgb.DMatrix() function
The xgboost.DMatrix() function creates a specialized data structure called DMatrix (short for Data Matrix). This data structure is optimized for memory efficiency and faster computation, making it ideal for large-scale datasets.
Once the DMatrix is created, it can be used directly in training with XGBoost algorithms like classification or regression tasks. It can also be used in cross-validation and hyperparameter tuning of modules.
Syntax
The syntax of the xgb.DMatrix() function is given below:
xgb.DMatrix(data, label=None, weight=None, base_margin=None, missing=None,silent=False, feature_names=None, feature_types=None, nthread=None, group=None,qid=None, label_lower_bound=None, label_upper_bound=None, feature_weights=None,enable_categorical=False, data_split_mode=DataSplitMode.ROW)
datais a required parameter representing the input data for the DMatrix.labelis an optional parameter that shows the target labels for the training data.weightis an optional parameter representing the weight for each instance.base_marginis an optional parameter that specifies the initial prediction score for the model.missingis an optional parameter representing the missing value in the data. By default, it is set toNone.silentis an optional parameter on whether to print messages during DMatrix creation. It is set toFalseby default.feature_namesis an optional parameter representing a list of feature names, which will be used to name the columns of the DMatrix.feature_typesis an optional parameter representing a list of strings that specify the types of features. It can be 'int', 'float', 'i', 'q', 'u', or 's'.nthreadis an optional parameter representing the number of threads for converting data. If not specified, the maximum number of available threads will be used.groupis an optional parameter representing the group or query ID for ranking tasks.qidis an optional parameter representing a query ID for ranking tasks, similar to thegroupparameter.label_lower_boundis an optional parameter representing the lower bound of the label values.label_upper_boundis an optional parameter representing the upper bound of the label values.feature_weightsis an optional parameter representing a weight for each feature.enable_categoricalis an optional parameter. If set toTrue, categorical features are treated as such during training and prediction.data_split_modeis an optional parameter specifying how data splits are performed when using different data containers. The default isDataSplitMode.ROW.
Note: Make sure you have the XGBoost library installed. Learn more about the error-free XGBoost installation on your system here.
Code
Let's illustrate the use of xgb.DMatrix() with a basic code example using the diabetes dataset:
import xgboost as xgbimport pandas as pdfrom sklearn.datasets import load_diabetesfrom sklearn.model_selection import train_test_split#Loading the diabetes datasetdata = load_diabetes()X, y = data.data, data.target#Splitting the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)#Creating a DMatrix for training and testing dataD_train = xgb.DMatrix(X_train, label=y_train)D_test = xgb.DMatrix(X_test, label=y_test)#Printing basic information about the DMatrixprint("Number of training samples in DMatrix:", D_train.num_row())print("Number of features in Dmatrix:", D_train.num_col())
Code explanation
Line 1–3: Firstly, we import the necessary modules. The
xgb,pdmodules andload_diabetesfrom thesklearn.datasetsmodule to load the dataset.Line 4: Next, we import the
train_test_splitfunction from thesklearn.model_selectionmodule to split the dataset into training and test sets.Line 7: Now, we fetch and store the diabetes dataset in the
datavariable.Line 8: We separate the features
Xand target labelsyfrom the loaded dataset in this line.Line 11: Here, we split the data into training and test sets using
train_test_split. It takes the featuresXand target labelsyas input and splits them. The test set size is0.2, which makes 20% of the whole dataset, and the random state is 42 to provide consistency.Line 14–15: In these lines, we create two instances of
xgb.DMatrix(). It takes the feature dataXand the target variableyas arguments for training and testing data separately.Line 18–19: Finally, we print the number of samples and features in the DMatrix using the
num_row()method and thenum_col()method of the DMatrix object.
Output
Upon execution, the code will show the number of samples and features in the DMatrix created from the diabetes dataset.
The output looks like this:
Number of training samples in DMatrix: 353Number of features in Dmatrix: 10
Conclusion
Overall, the xgboost.DMatrix() function is a vital part of XGBoost that improves performance and simplifies the training process by effectively processing data, making it memory efficient for large-scale machine-learning models. This function becomes necessary for using XGBoost's potential in various real-world machine-learning applications.
Free Resources