Ensemble Methods

Explore two powerful ensemble techniques—bagging and boosting—which improve model performance by combining multiple weak learners.

When a single model isn’t quite enough, ensemble methods step in. Bagging reduces variance by combining models trained on random subsets of data, while boosting focuses on learning from mistakes to improve accuracy. In this lesson, we’ll get hands-on with both strategies and understand how they power some of the best models in machine learning. Let’s begin.

Implement a random forest classifier

You’re asked to implement an ensemble classifier from scratch using bagging principles. Your task is to build a simplified Random Forest that aggregates predictions from multiple decision trees trained on bootstrapped samples of the training set. Use an object-oriented approach to implement the classifier, train it on a dataset, and make predictions.

This question is frequently asked in Systems design and ML interview rounds to evaluate object-oriented design and ensemble strategy understanding.

Sample answer

Ensemble methods combine multiple machine learning models to improve prediction performance beyond what could be achieved by any single model. These techniques are powerful tools that can significantly enhance the accuracy, stability, and robustness of predictive models.

Ensemble methods work by training multiple models and combining their predictions. The key insight behind ensemble learning is that different models capture different aspects of the data, and by combining them, we can create a more comprehensive and accurate predictive system. Ensemble methods generally fall into two categories: bagging (parallel ensemble) and boosting (sequential ensemble).

Bagging (bootstrap aggregating) creates multiple versions of a model trained on different bootstrap samples of the data and combines their predictions through voting or averaging. Random forest is one of the most popular bagging techniques.

Breakdown of the implementation process:

  1. Bootstrap sampling: Each tree is trained on a randomly sampled subset of the data with replacement (bootstrap sample). This introduces diversity among the trees by exposing them to different training examples.

  2. Decision stumps as weak learners: Our implementation uses simple one-level decision trees (stumps) that make decisions based on a single feature and threshold. While full random forests typically use deeper trees, stumps demonstrate the concept while being computationally efficient.

  3. Finding the best split: Each stump examines every feature and possible threshold to find the split that minimizes classification error. This is a simplified version of the impurity measures (like Gini impurity) used in production implementations.

  4. Majority voting: The final prediction combines the predictions from all trees through majority voting. This ensemble approach reduces overfitting and improves generalization compared to a single decision tree.

Let’s see a sample implementation below:

Press + to interact
Python 3.10.4
import numpy as np
from collections import Counter
class DecisionStump:
"""
A simple decision tree stump (one-level decision tree) for binary classification
"""
def __init__(self):
self.feature_index = None
self.threshold = None
self.left_label = None
self.right_label = None
def fit(self, X, y, sample_weights=None):
# Initialize default sample weights if none provided
if sample_weights is None:
sample_weights = np.ones(len(y)) / len(y)
n_samples, n_features = X.shape
best_error = float('inf')
# For each feature
for feature_idx in range(n_features):
# Get unique values for this feature
feature_values = X[:, feature_idx]
thresholds = np.unique(feature_values)
# Skip if only one unique value (would cause an error)
if len(thresholds) < 2:
continue
# For each possible threshold value
for threshold in thresholds:
# Predict with left as -1, right as 1
left_idx = X[:, feature_idx] <= threshold
right_idx = ~left_idx
# Try different label assignments
potential_labels = [(-1, 1), (1, -1)]
for left_label, right_label in potential_labels:
y_pred = np.ones(n_samples) * right_label
y_pred[left_idx] = left_label
# Calculate weighted error
error = np.sum(sample_weights[y_pred != y])
# If we found a better split, update our model
if error < best_error:
best_error = error
self.feature_index = feature_idx
self.threshold = threshold
self.left_label = left_label
self.right_label = right_label
return best_error
def predict(self, X):
n_samples = X.shape[0]
y_pred = np.ones(n_samples) * self.right_label
left_idx = X[:, self.feature_index] <= self.threshold
y_pred[left_idx] = self.left_label
return y_pred
class RandomForestClassifier:
"""
A simplified Random Forest classifier using decision stumps and bagging
"""
def __init__(self, n_trees=10):
self.n_trees = n_trees
self.trees = []
def fit(self, X, y):
n_samples = X.shape[0]
# Create n_trees decision stumps on bootstrap samples
for _ in range(self.n_trees):
# Create bootstrap sample
bootstrap_indices = np.random.choice(n_samples, n_samples, replace=True)
X_bootstrap = X[bootstrap_indices]
y_bootstrap = y[bootstrap_indices]
# Create and train decision stump
tree = DecisionStump()
tree.fit(X_bootstrap, y_bootstrap)
self.trees.append(tree)
def predict(self, X):
# Get predictions from all trees
tree_predictions = np.array([tree.predict(X) for tree in self.trees])
# Use majority voting to get final prediction
# More efficient numpy implementation instead of using Counter
return np.sign(np.sum(tree_predictions, axis=0))
# Demonstration
def random_forest_demo():
# Generate a non-linearly separable dataset
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=100,
n_features=2,
n_informative=2,
n_redundant=0,
n_clusters_per_class=2,
random_state=42
)
# Convert to binary labels -1, 1
y = np.where(y == 0, -1, 1)
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create and train random forest classifier
rf = RandomForestClassifier(n_trees=20)
rf.fit(X_train, y_train)
# Make predictions
y_pred = rf.predict(X_test)
# Calculate accuracy
accuracy = np.mean(y_pred == y_test)
print(f"Random Forest Accuracy: {accuracy:.4f}")
# Visualize results
import matplotlib.pyplot as plt
# Create a mesh grid for visualization
h = 0.02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict on mesh grid
Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary
plt.contourf(xx, yy, Z, alpha=0.3)
# Plot data points
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k')
plt.title(f"Random Forest Decision Boundary (Accuracy: {accuracy:.4f})")
plt.savefig('output/graph.png', dpi=300)
if __name__ == "__main__":
random_forest_demo()

...