Classification using XGBoost in Python

Key takeaways:

XGBoost is a high-performance algorithm optimized for speed and efficiency. Therefore, It is suitable for large datasets and tasks requiring fast predictions, such as real-time applications.
XGBoost includes L1 and L2 regularization methods, which help reduce overfitting and improve the model’s ability to generalize well to new data.
Unlike many algorithms, XGBoost can handle missing values without needing special preprocessing, which simplifies data preparation.
XGBoost’s XGBClassifier is tailored for classification and offers many tuning parameters, such as max_depth, learning_rate, and n_estimators, to help users improve performance based on specific needs.
With precise control over model complexity and feature selection, XGBoost can avoid both over-complicated and overly simplistic models.
XGBoost is commonly used across various machine learning problems, including regression and ranking tasks, and is well-integrated with Python libraries, making it versatile for different applications.

XGBoost is built to handle large-scale datasets and works seamlessly with other Python machine-learning libraries.

Why use XGBoost?

We mainly use XGBoost because it offers many essential features that make it ideal for classification tasks. Some of the features are given below:

High performance: As mentioned above, XGBoost is optimized for speed and efficiency, making it appropriate for large datasets and real-time applications.
Regularization methods: L1 (Lasso) and L2 (Ridge) regularization terms are included in XGBoost to avoid overfitting and increase generalization.
Handle missing data: Moreover, XGBoost can handle missing data automatically, minimizing the need for preprocessing and imputation.

Classification in XGBoost

Classification is one of the most frequent XGBoost applications. Based on the input characteristics, it predicts a discrete class label. The XGBClassifier module, specially built for handling classification jobs, is used to accomplish classification.

Syntax of `XGBClassifier`

The XGBClassifier class in XGBoost provides several hyperparameters that may be adjusted to improve performance.

Here is the basic syntax for generating an XGBoost classifier:

objective='multi:softprob' is an optional parameter where the objective function is used for multi-class classification, returning a probability score for each class. The default value for objective is 'binary:logistic' for binary classification.
num_class=num_classes is a required parameter for multi-class classification tasks and represents the number of classes in the dataset.
max_depth=max_depth is an optional parameter representing the maximum depth of each decision tree.
learning_rate=learning_rate is an optional parameter where step size shrinkage prevents overfittingOverfitting occurs when a machine learning model performs well on the training data but poorly on unseen data, indicating it has memorized the training set and lacks generalization..
subsample=subsample is an optional parameter representing the fraction of samples used for each tree.
colsample_bytree=colsample is an optional parameter representing the fraction of features used for each tree.
n_estimators=num_estimators is a required parameter that determines the number of boosting iterations and controls the overall complexity of the model.

Note: Make sure you have the XGBoost library installed. Learn more about the error-free XGBoost installation on your system here.

Code example

The Iris dataset, which comprises 150 examples of iris flowers with four parameters (sepal length, sepal width, petal length, and petal width), is well-known in machine learning. We aim to classify three iris flower species: setosa, versicolor, and virginica.

Let’s demonstrate classification using the XGBoost library using the Iris dataset:

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
#Loading the Iris dataset
data = load_iris()
X, y = data.data, data.target
#Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Creating an XGBoost classifier
model = xgb.XGBClassifier()
#Training the model on the training data
model.fit(X_train, y_train)
#Making predictions on the test set
predictions = model.predict(X_test)
#Calculating accuracy
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=data.target_names))

Explanation

Line 1–2: Firstly, we import the necessary modules and functions. The xgb module and the function load_iris from scikit-learn’s datasets module to load the famous Iris dataset.
Line 3–4: Next, we import the train_test_split function from scikit-learn’s model_selection module to split the dataset into training and test sets, and the accuracy_score and classification_report functions from scikit-learn’s metrics module to evaluate the model’s performance.
Line 7: Now, we load the Iris dataset using load_iris() and storing it in the data variable.
Line 8: We separate the features X and target labels y from the loaded dataset in this line.
Line 11: Here, we split the data into training and test sets using train_test_split. It takes the features X and target labels y as input and splits them. The test set size is 0.2, which makes 20% of the whole dataset, and the random state is 42 to provide consistency.
Line 14: We create an XGBoost classifier using the XGBClassifier class with default hyperparameters.
Line 17: We train the XGBoost classifier on the training data X_train, y_train using the fit method.
Line 20: Next, we predict target labels on the test set X_test using our trained model and the predict method.
Line 23: Moving on, we calculate the model’s accuracy by comparing the predicted target labels predictions with the true target labels from the test set y_test.
Line 25–27: Finally, we print the model’s accuracy on the test set and the classification report, which contains precision, recall, F1-score, and support for each class in the Iris dataset. Instead of numerical numbers, the target names are given to show the class labels or species names.

Output

Upon execution, the code will show the model’s accuracy on the test set and the detailed classification report with precision, recall, F1-score, and support for each class.

The output shows that the model achieved an accuracy of 100%, meaning it correctly classified all samples. The precision, recall, and F1-score are also perfect, i.e., 1.00 for each class, indicating that the model predicted each class without any mistakes. This result shows that the model performed exceptionally well on this dataset.

Conclusion

To conclude, XGBoost is a powerful library for machine learning tasks, especially classification. It offers high-performance and regularization strategies that make it suitable for various applications. Using XGBoost’s capabilities, we obtained 100% or 1.0 accuracy in classifying Iris flowers into their respective species. XGBoost’s versatility and efficiency are potent tools for various real-world classification problems.

If you’re curious to learn more about how XGBoost is used in machine learning, check out these helpful resources:

Text Classification Using PyTorch: This project provides you with a hands-on experience of using XGBoost for text classification.
Predict Frog Toxicity with Python and XGBoost: Explore a fascinating challenge to predict the toxicity of frogs based on their luminosity using XGBoost.

Frequently asked questions

Haven’t found what you were looking for? Contact Us

What is XGBoost algorithm classification?

XGBoost is a powerful machine learning algorithm based on gradient boosting designed to improve prediction accuracy. It works by creating an ensemble of decision trees, where each tree learns from the mistakes of the previous one to improve overall performance. XGBoost is especially known for its speed and efficiency in handling large datasets.

Is XGBoost supervised or unsupervised learning?

XGBoost is a supervised learning algorithm, meaning it requires labeled data to train on and learn patterns to make predictions. It uses these labels to minimize errors and improve accuracy in its predictions.

Is XGBoost good for classification?

Yes, XGBoost is highly effective for classification tasks and often produces accurate results. It’s widely used in competitions and real-world applications due to its strong predictive power and ability to prevent overfitting.

Is XGBoost regression or classification?

XGBoost can be used for both regression and classification tasks. It’s a flexible algorithm that can handle various types of predictive modeling, depending on how it’s configured and the objective specified.

How is XGBoost different from random forest?

XGBoost uses a boosting approach, where each tree is built to correct errors from the previous ones, while Random Forest uses bagging, where each tree is trained independently on random subsets of data. This difference makes XGBoost more focused on reducing errors but also more complex to tune.

Free Resources