Regression using XGBoost in Python
XGBoost (eXtreme Gradient Boosting) is a well-known and robust machine learning algorithm often used for supervised learning tasks such as classification, regression, and ranking. It is based on gradient-boosting architecture and has gained popularity because of its high accuracy and scalability.
Its versatility makes it handle large datasets and manufacture complex data relationships.
Why do we use XGBoost?
Typically, we employ XGBoost since it has plenty of useful features for handling regression tasks.
Some of the reasons are as follows:
Speed and efficiency: XGBoost is highly optimized and supports parallel processing, much faster than traditional gradient boosting methods.
Handling non-linear relationships: It can capture complex relationships between input features and target variables.
Feature importance: XGBoost allows for better feature selection and understanding of model behavior.
Regression in XGBoost
Regression is an algorithm for predicting continuous numerical values in XGBoost. It is widely used to estimate housing prices, sales, or stock prices when the objective variable reflects a continuous output.
Syntax of XGBRegressor
The XGBRegressor in Python is the regression-specific implementation of XGBoost and is used for regression problems where the intent is to predict continuous numerical values.
Here is the basic syntax to create an XGBRegressor module:
import xgboost as xgbmodel = xgb.XGBRegressor(objective='reg:squarederror',max_depth=max_depth,learning_rate=learning_rate,subsample=subsample,colsample_bytree=colsample,n_estimators=num_estimators)
objectiveis a required parameter representing the objective function to use for regression. It is set to 'reg:squarederror' using squared loss for regression tasks.max_depthis an optional parameter that shows the maximum depth of each decision tree.learning_rateis an optional parameter where the step size shrinkage prevents .overfitting Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen data, indicating it has memorized the training set and lacks generalization. subsampleis an optional parameter representing the fraction of samples used for each tree.colsample_bytreeis an optional parameter representing the fraction of features used for each tree.n_estimatorsis a required parameter that determines the number of boosting iterations and controls the overall complexity of the model.
Note: Make sure you have the XGBoost library installed. Learn more about the error-free XGBoost installation on your system here.
Code
We will use the California Housing dataset, which provides information on California's housing districts in our code. The dataset contains input features X and target variables y, representing the median house value for California districts.
Let's walk through the regression process on this dataset using the XGBoost framework:
import xgboost as xgbfrom sklearn.datasets import fetch_california_housingfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_error, r2_score#Loading the California housing datasetdata = fetch_california_housing(as_frame=True)X, y = data.data, data.target#Splitting the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)#Creating an XGBoost regressormodel = xgb.XGBRegressor()#Training the model on the training datamodel.fit(X_train, y_train)#Making predictions on the test setpredictions = model.predict(X_test)# Calculate the mean squared error and R-squared scoremse = mean_squared_error(y_test, predictions)r2 = r2_score(y_test, predictions)print("Mean Squared Error:", mse)print("R-squared Score:", r2)
Code explanation
Line 1–2: Firstly, we import the necessary modules. The
xgbmodule and the California housing dataset usingfetch_california_housingfrom scikit-learn'sdatasetsmodule.Line 3–4: Next, we import the
train_test_splitfrom scikit-learn’smodel_selectionmodule to split the dataset into training and test sets, and themean_squared_errorandr2_scorefrommetricsmodule to check errors and scores.Line 7: Now, we fetch the California housing dataset and store it in the
datavariable.Line 8: We separate the features
Xand target labelsyfrom the loaded dataset in this line.Line 11: Here, we split the data into training and test sets using
train_test_split. It takes the featuresXand target labelsyas input and splits them. The test set size is0.2, which makes 20% of the whole dataset, and the random state is 42 to provide consistency.Line 14: We create an instance of the XGBoost regressor using
xgb.XGBRegressor()with default hyperparameters.Line 17: Here, we train the model on the training data using the
fitmethod.Line 20: Next, we predict target labels on the test set
X_testusing our trained model and thepredictmethod.Line 23–24: Moving on, we calculate the
andmean squared error The mean squared error is a measure of the average squared difference between the predicted and actual target values. A lower MSE indicates that the model’s predictions are closer to the true values, signifying better performance. to evaluate the model’s performance and check errors and scores.R-squared score The R-squared (R2) score is a statistical measure representing the proportion of the variance in the target variable that is predictable from the input features. It ranges from 0 to 1, where a score of 1 indicates a perfect fit. Line 26–27: Finally, we print the model's mean squared error and R-squared score on the console.
Output
Upon execution, the code will show the mean squared error and R-squared score to evaluate the model's performance.
The output looks something like this:
Mean Squared Error: 0.22458289556216388R-squared Score: 0.828616180679985
In the above example, the calculated MSE is around 0.224, indicating that the XGBoost regressor's predictions are rather accurate.
The R2 value of 0.827 shows that the XGBoost regressor explains about 82.7% of the variation in the target variable, indicating a rather ideal match.
Let's further improve the performance of the XGBoost model with parameter tuning. For example, defining max_depth and n_estimators parameters in our case led to improved model performance.
#Creating an XGBoost regressormodel = xgb.XGBRegressor(max_depth=4, n_estimators=500)#Training the model on the training datamodel.fit(X_train, y_train)#Making predictions on the test setpredictions = model.predict(X_test)# Calculate the mean squared error and R-squared scoremse = mean_squared_error(y_test, predictions)r2 = r2_score(y_test, predictions)print("Mean Squared Error:", mse)print("R-squared Score:", r2)
Conclusion
In conclusion, XGBoost is an extensively used framework for regression problems. Its ability to handle complex datasets, as well as its efficient gradient boosting, makes it ideal for regression models that predict continuous numerical values properly. The constant growth guarantees that XGBoost remains at the top of regression approaches, making it a vital tool for regression analysis in the field of machine learning.
Free Resources