Ensemble methods in Python: Bagging
Ensemble methods in machine learning leverage the power of combining multiple models to enhance overall performance. This approach is particularly effective when individual models may have limitations or biases. One prominent ensemble technique is Bagging (also known as bootstrap aggregating).
Bagging aims to reduce overfitting and
How to implement bagging using Python
Follow the steps below to implement the bagging algorithm in Python:
1. Import the libraries
The first step is to import the required libraries, as shown in the code below:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_breast_cancerfrom sklearn.metrics import accuracy_score
2. Load the dataset
The next step is to load the dataset. We’ll use the breast cancer dataset provided by the sklearn library. This dataset consists of 30 features. The target variable is the diagnosis where 1 represents malignant and 0 represents benign tumors. The train_test_split function divides the dataset into training and testing data.
cancer = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=10) # Changed random_state
3. Define the base model
The next step is to choose the base models. Averaging classifier uses multiple models to calculate the weighted average. We’ll use the random forest classifier for this example. The n_estimators parameter dictates the number of trees in the forest, and random_state ensures reproducibility. Adjusting hyperparameters like n_estimators, max_depth, and max_features allows fine-tuning the model's performance.
base_model = RandomForestClassifier(n_estimators=10, max_depth=3, max_features='sqrt', random_state=42) # You can adjust hyperparameters
4. Implement bagging
We will now create an instance for the BaggingClassifier and fit the training data to train the model. The base_model parameter specifies the underlying model to be used, while n_estimators determines the number of base models in the ensemble. The random_state parameter ensures reproducibility by setting the seed for random number generation.
bagging_model = BaggingClassifier(base_model, n_estimators=50, random_state=20)bagging_model.fit(X_train, y_train)
5. Predict and evaluate
Now, we will make the predictions on the test set and calculate accuracy.
y_pred = bagging_model.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print("Accuracy: {:.2f}%".format(accuracy * 100))
Example
The following code shows how we can implement the bagging ensemble classifier in Python:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_breast_cancerfrom sklearn.metrics import accuracy_score# Load and split the datacancer = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=10) # Changed random_state# Use RandomForestClassifier with max_features='sqrt' for randomnessbase_model = RandomForestClassifier(n_estimators=10, max_depth=3, max_features='sqrt', random_state=42) # We can adjust hyperparameters# Implement bagging with different base models for diversitybagging_model = BaggingClassifier(base_model, n_estimators=50, random_state=20)bagging_model.fit(X_train, y_train)# Predict and evaluatey_pred = bagging_model.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print("Accuracy: {:.2f}%".format(accuracy * 100))
Explanation:
Lines 1–4: These lines import the required libraries.
Line 7: This line loads the dataset from
sklearnand stores it in thedatavariable.Line 8: This line splits the dataset into train and test.
Line 11: We define
RandomForestClassifieras this line’s base model for bagging.Lines 14--15: Here, we create a
BaggingClassifierwith 50 base models and fit the bagging model on the training data. TheBaggingClassifierhandles the bootstrap sampling internally when fitting the model.Line 18: The trained model is used to make predictions on the test data.
Lines 19–20: The code calculates the accuracy of the model’s predictions by comparing them to the true labels in the test set. The accuracy is printed as a percentage.
Unlock your potential: Ensemble learning series, all in one place!
To continue your exploration of ensemble learning, check out our series of Answers below:
What is ensemble learning?
Understand the concept of combining multiple models to improve predictions.Ensemble methods in Python: Averaging
Learn how averaging methods can boost model accuracy and stability.Ensemble methods in Python: Bagging
Discover the power of bagging in reducing variance and enhancing prediction performance.Ensemble methods in Python: Boosting
Dive into boosting techniques that improve weak models by focusing on mistakes.Ensemble methods in Python: Stacking
Understand how stacking combines multiple models to make better predictions.Ensemble methods in Python: Max voting
Explore the max voting method to combine classifier predictions and increase accuracy.
Free Resources