Ensemble Methods
Explore two powerful ensemble techniques—bagging and boosting—which improve model performance by combining multiple weak learners.
When a single model isn’t quite enough, ensemble methods step in. Bagging reduces variance by combining models trained on random subsets of data, while boosting focuses on learning from mistakes to improve accuracy. In this lesson, we’ll get hands-on with both strategies and understand how they power some of the best models in machine learning. Let’s begin.
Implement a random forest classifier
You’re asked to implement an ensemble classifier from scratch using bagging principles. Your task is to build a simplified Random Forest that aggregates predictions from multiple decision trees trained on bootstrapped samples of the training set. Use an object-oriented approach to implement the classifier, train it on a dataset, and make predictions.
This question is frequently asked in Systems design and ML interview rounds to evaluate object-oriented design and ensemble strategy understanding.
Sample answer
Ensemble methods combine multiple machine learning models to improve prediction performance beyond what could be achieved by any single model. These techniques are powerful tools that can significantly enhance the accuracy, stability, and robustness of predictive models.
Ensemble methods work by training multiple models and combining their predictions. The key insight behind ensemble learning is that different models capture different aspects of the data, and by combining them, we can create a more comprehensive and accurate predictive system. Ensemble methods generally fall into two categories: bagging (parallel ensemble) and boosting (sequential ensemble).
Bagging (bootstrap aggregating) creates multiple versions of a model trained on different bootstrap samples of the data and combines their predictions through voting or averaging. Random forest is one of the most popular bagging techniques.
Breakdown of the implementation process:
Bootstrap sampling: Each tree is trained on a randomly sampled subset of the data with replacement (bootstrap sample). This introduces diversity among the trees by exposing them to different training examples.
Decision stumps as weak learners: Our implementation uses simple one-level decision trees (stumps) that make decisions based on a single feature and threshold. While full random forests typically use deeper trees, stumps demonstrate the concept while being computationally efficient.
Finding the best split: Each stump examines every feature and possible threshold to find the split that minimizes classification error. This is a simplified version of the impurity measures (like Gini impurity) used in production implementations.
Majority voting: The final prediction combines the predictions from all trees through majority voting. This ensemble approach reduces overfitting and improves generalization compared to a single decision tree.
Let’s see a sample implementation below:
import numpy as npfrom collections import Counterclass DecisionStump:"""A simple decision tree stump (one-level decision tree) for binary classification"""def __init__(self):self.feature_index = Noneself.threshold = Noneself.left_label = Noneself.right_label = Nonedef fit(self, X, y, sample_weights=None):# Initialize default sample weights if none providedif sample_weights is None:sample_weights = np.ones(len(y)) / len(y)n_samples, n_features = X.shapebest_error = float('inf')# For each featurefor feature_idx in range(n_features):# Get unique values for this featurefeature_values = X[:, feature_idx]thresholds = np.unique(feature_values)# Skip if only one unique value (would cause an error)if len(thresholds) < 2:continue# For each possible threshold valuefor threshold in thresholds:# Predict with left as -1, right as 1left_idx = X[:, feature_idx] <= thresholdright_idx = ~left_idx# Try different label assignmentspotential_labels = [(-1, 1), (1, -1)]for left_label, right_label in potential_labels:y_pred = np.ones(n_samples) * right_labely_pred[left_idx] = left_label# Calculate weighted errorerror = np.sum(sample_weights[y_pred != y])# If we found a better split, update our modelif error < best_error:best_error = errorself.feature_index = feature_idxself.threshold = thresholdself.left_label = left_labelself.right_label = right_labelreturn best_errordef predict(self, X):n_samples = X.shape[0]y_pred = np.ones(n_samples) * self.right_labelleft_idx = X[:, self.feature_index] <= self.thresholdy_pred[left_idx] = self.left_labelreturn y_predclass RandomForestClassifier:"""A simplified Random Forest classifier using decision stumps and bagging"""def __init__(self, n_trees=10):self.n_trees = n_treesself.trees = []def fit(self, X, y):n_samples = X.shape[0]# Create n_trees decision stumps on bootstrap samplesfor _ in range(self.n_trees):# Create bootstrap samplebootstrap_indices = np.random.choice(n_samples, n_samples, replace=True)X_bootstrap = X[bootstrap_indices]y_bootstrap = y[bootstrap_indices]# Create and train decision stumptree = DecisionStump()tree.fit(X_bootstrap, y_bootstrap)self.trees.append(tree)def predict(self, X):# Get predictions from all treestree_predictions = np.array([tree.predict(X) for tree in self.trees])# Use majority voting to get final prediction# More efficient numpy implementation instead of using Counterreturn np.sign(np.sum(tree_predictions, axis=0))# Demonstrationdef random_forest_demo():# Generate a non-linearly separable datasetfrom sklearn.datasets import make_classificationX, y = make_classification(n_samples=100,n_features=2,n_informative=2,n_redundant=0,n_clusters_per_class=2,random_state=42)# Convert to binary labels -1, 1y = np.where(y == 0, -1, 1)# Split the data into training and testing setsfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# Create and train random forest classifierrf = RandomForestClassifier(n_trees=20)rf.fit(X_train, y_train)# Make predictionsy_pred = rf.predict(X_test)# Calculate accuracyaccuracy = np.mean(y_pred == y_test)print(f"Random Forest Accuracy: {accuracy:.4f}")# Visualize resultsimport matplotlib.pyplot as plt# Create a mesh grid for visualizationh = 0.02 # step size in the meshx_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))# Predict on mesh gridZ = rf.predict(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)# Plot decision boundaryplt.contourf(xx, yy, Z, alpha=0.3)# Plot data pointsplt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k')plt.title(f"Random Forest Decision Boundary (Accuracy: {accuracy:.4f})")plt.savefig('output/graph.png', dpi=300)if __name__ == "__main__":random_forest_demo()
...