Ensemble methods in Python: Boosting

Ensemble learning is a machine learning technique that combines the prediction of multiple models to create a more accurate and robust overall prediction.

Boosting is an ensemble learning technique that aims to improve a model’s predictive performance by combining the strengths of multiple weak learners (also called base models)Weak learners or base models in machine learning are models that perform slightly better than random chance but lack high accuracy individually. Ensemble methods like boosting sequentially train weak learners, with each subsequent model addressing the mistakes of the previous ones. . Unlike baggingBagging (Bootstrap Aggregating) ensemble combines predictions from multiple models trained on different bootstrap samples of the dataset to improve overall performance and reduce overfitting., which builds independent models in parallel, boosting sequentially builds a sequence of models. Each subsequent model focuses on correcting the errors of the previous ones, leading to a more accurate and robust overall model.

Boosting algorithm
Boosting algorithm

How to implement boosting using Python

Follow the steps below to implement the boosting algorithm in Python:

1. Import the libraries

The first step is to import the required libraries, as shown in the code below:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error

2. Load the dataset

The next step is to load the dataset. We will use the Boston dataset provided by the sklearn library. The Boston house-prices dataset consists of 506 rows and 13 columns. The train_test_split function divides the dataset into training and testing data.

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)

3. Implement boosting

We will now create an instance for the GradientBoostingRegressor and fit the training data to train the model. The n_estimators parameter dictates the number of trees in the forest, and random_state ensures reproducibility. Adjusting hyperparameters like n_estimatorsmax_depth, and learning_rate allows fine-tuning the model’s performance.

gradient_boosting_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gradient_boosting_model.fit(X_train, y_train)

4. Predict and evaluate

Now, we will make the predictions on the test set and calculate mean_squared_error.

y_pred = gradient_boosting_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: {:.2f}%".format(mse))

Example

The following code shows how we can implement the boosting ensemble classifier in Python:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
# Load and split the data
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)
# Implement gradient boosting regressor
gradient_boosting_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gradient_boosting_model.fit(X_train, y_train)
# Predict and evaluate
y_pred = gradient_boosting_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: {:.2f}%".format(mse))

Explanation

  • Lines 1–4: These lines import the required libraries.

  • Line 7: This line loads the Boston dataset from sklearn and stores it in the data variable.

  • Line 8: This line splits the dataset into train and test.

  • Lines 11–12: Here, we create a GradientBoostingRegressor with 50 base models and fit the boosting model on the training data.

  • Line 15: The trained model is used to make predictions on the test data.

  • Line 16: The code calculates the mean_squared_error of the model's predictions by comparing them to the true labels in the test set. The mean_squared_error is printed as a percentage.

  • Line 17: The output line prints the mean_squared_error between the actual and predicted housing prices, providing a measure of the model’s performance.

Copyright ©2024 Educative, Inc. All rights reserved