Regression using XGBoost in Python

XGBoost (eXtreme Gradient Boosting) is a well-known and robust machine learning algorithm often used for supervised learning tasks such as classification, regression, and ranking. It is based on gradient-boosting architecture and has gained popularity because of its high accuracy and scalability.

Its versatility makes it handle large datasets and manufacture complex data relationships.

Why do we use XGBoost?

Typically, we employ XGBoost since it has plenty of useful features for handling regression tasks.
Some of the reasons are as follows:

Speed and efficiency: XGBoost is highly optimized and supports parallel processing, much faster than traditional gradient boosting methods.
Handling non-linear relationships: It can capture complex relationships between input features and target variables.
Feature importance: XGBoost allows for better feature selection and understanding of model behavior.

Regression in XGBoost

Regression is an algorithm for predicting continuous numerical values in XGBoost. It is widely used to estimate housing prices, sales, or stock prices when the objective variable reflects a continuous output.

Syntax of `XGBRegressor`

The XGBRegressor in Python is the regression-specific implementation of XGBoost and is used for regression problems where the intent is to predict continuous numerical values.

Here is the basic syntax to create an XGBRegressor module:

objective is a required parameter representing the objective function to use for regression. It is set to 'reg:squarederror' using squared loss for regression tasks.
max_depth is an optional parameter that shows the maximum depth of each decision tree.
learning_rate is an optional parameter where the step size shrinkage prevents overfittingOverfitting occurs when a machine learning model performs well on the training data but poorly on unseen data, indicating it has memorized the training set and lacks generalization..
subsample is an optional parameter representing the fraction of samples used for each tree.
colsample_bytree is an optional parameter representing the fraction of features used for each tree.
n_estimators is a required parameter that determines the number of boosting iterations and controls the overall complexity of the model.

Note: Make sure you have the XGBoost library installed. Learn more about the error-free XGBoost installation on your system here.

Code

We will use the California Housing dataset, which provides information on California's housing districts in our code. The dataset contains input features X and target variables y, representing the median house value for California districts.

Let's walk through the regression process on this dataset using the XGBoost framework:

import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
#Loading the California housing dataset
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
#Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Creating an XGBoost regressor
model = xgb.XGBRegressor()
#Training the model on the training data
model.fit(X_train, y_train)
#Making predictions on the test set
predictions = model.predict(X_test)
# Calculate the mean squared error and R-squared score
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("Mean Squared Error:", mse)
print("R-squared Score:", r2)

Code explanation

Line 1–2: Firstly, we import the necessary modules. The xgb module and the California housing dataset using fetch_california_housing from scikit-learn's datasets module.
Line 3–4: Next, we import the train_test_split from scikit-learn’s model_selection module to split the dataset into training and test sets, and the mean_squared_error and r2_score from metrics module to check errors and scores.
Line 7: Now, we fetch the California housing dataset and store it in the data variable.
Line 8: We separate the features X and target labels y from the loaded dataset in this line.
Line 11: Here, we split the data into training and test sets using train_test_split. It takes the features X and target labels y as input and splits them. The test set size is 0.2, which makes 20% of the whole dataset, and the random state is 42 to provide consistency.
Line 14: We create an instance of the XGBoost regressor using xgb.XGBRegressor() with default hyperparameters.
Line 17: Here, we train the model on the training data using the fit method.
Line 20: Next, we predict target labels on the test set X_test using our trained model and the predict method.
Line 23–24: Moving on, we calculate the mean squared errorThe mean squared error is a measure of the average squared difference between the predicted and actual target values. A lower MSE indicates that the model’s predictions are closer to the true values, signifying better performance. and R-squared scoreThe R-squared (R2) score is a statistical measure representing the proportion of the variance in the target variable that is predictable from the input features. It ranges from 0 to 1, where a score of 1 indicates a perfect fit. to evaluate the model’s performance and check errors and scores.
Line 26–27: Finally, we print the model's mean squared error and R-squared score on the console.

Output

Upon execution, the code will show the mean squared error and R-squared score to evaluate the model's performance.

The output looks something like this: