Data Science and Machine Learning Interview Handbook/

...

Ensemble Methods

Explore two powerful ensemble techniques—bagging and boosting—which improve model performance by combining multiple weak learners.

We'll cover the following...

Implement a random forest classifier
- Sample answer
Adaptive Boosting (AdaBoost) implementation
- Sample answer
Comparing bagging and boosting
- Sample answer

When a single model isn’t quite enough, ensemble methods step in. Bagging reduces variance by combining models trained on random subsets of data, while boosting focuses on learning from mistakes to improve accuracy. In this lesson, we’ll get hands-on with both strategies and understand how they power some of the best models in machine learning. Let’s begin.

Implement a random forest classifier

You’re asked to implement an ensemble classifier from scratch using bagging principles. Your task is to build a simplified Random Forest that aggregates predictions from multiple decision trees trained on bootstrapped samples of the training set. Use an object-oriented approach to implement the classifier, train it on a dataset, and make predictions.

This question is frequently asked in Systems design and ML interview rounds to evaluate object-oriented design and ensemble strategy understanding.

Sample answer

Ensemble methods combine multiple machine learning models to improve prediction performance beyond what could be achieved by any single model. These techniques are powerful tools that can significantly enhance the accuracy, stability, and robustness of predictive models.

Ensemble methods work by training multiple models and combining their predictions. The key insight behind ensemble learning is that different models capture different aspects of the data, and by combining them, we can create a more comprehensive and accurate predictive system. Ensemble methods generally fall into two categories: bagging (parallel ensemble) and boosting (sequential ensemble).

Bagging (bootstrap aggregating) creates multiple versions of a model trained on different bootstrap samples of the data and combines their predictions through voting or averaging. Random forest is one of the most popular bagging techniques.

Breakdown of the implementation process:

Bootstrap sampling: Each tree is trained on a randomly sampled subset of the data with replacement (bootstrap sample). This introduces diversity among the trees by exposing them to different training examples.
Decision stumps as weak learners: Our implementation uses simple one-level decision trees (stumps) that make decisions based on a single feature and threshold. While full random forests typically use deeper trees, stumps demonstrate the concept while being computationally efficient.
Finding the best split: Each stump examines every feature and possible threshold to find the split that minimizes classification error. This is a simplified version of the impurity measures (like Gini impurity) used in production implementations.
Majority voting: The final prediction combines the predictions from all trees through majority voting. This ensemble approach reduces overfitting and improves generalization compared to a single decision tree.

Let’s see a sample implementation below:

Press + to interact

Python 3.10.4

import numpy as np
from collections import Counter
class DecisionStump:
    """
    A simple decision tree stump (one-level decision tree) for binary classification
    """
    def __init__(self):
        self.feature_index = None
        self.threshold = None
        self.left_label = None
        self.right_label = None
    
    def fit(self, X, y, sample_weights=None):
        # Initialize default sample weights if none provided
        if sample_weights is None:
            sample_weights = np.ones(len(y)) / len(y)
        
        n_samples, n_features = X.shape
        best_error = float('inf')
        
        # For each feature
        for feature_idx in range(n_features):
            # Get unique values for this feature
            feature_values = X[:, feature_idx]
            thresholds = np.unique(feature_values)
            
            # Skip if only one unique value (would cause an error)
            if len(thresholds) < 2:
                continue
                
            # For each possible threshold value
            for threshold in thresholds:
                # Predict with left as -1, right as 1
                left_idx = X[:, feature_idx] <= threshold
                right_idx = ~left_idx
                
                # Try different label assignments
                potential_labels = [(-1, 1), (1, -1)]
                
                for left_label, right_label in potential_labels:
                    y_pred = np.ones(n_samples) * right_label
                    y_pred[left_idx] = left_label
                    
                    # Calculate weighted error
                    error = np.sum(sample_weights[y_pred != y])
                    
                    # If we found a better split, update our model
                    if error < best_error:
                        best_error = error
                        self.feature_index = feature_idx
                        self.threshold = threshold
                        self.left_label = left_label
                        self.right_label = right_label
        
        return best_error
    
    def predict(self, X):
        n_samples = X.shape[0]
        y_pred = np.ones(n_samples) * self.right_label
        left_idx = X[:, self.feature_index] <= self.threshold
        y_pred[left_idx] = self.left_label
        return y_pred
class RandomForestClassifier:
    """
    A simplified Random Forest classifier using decision stumps and bagging
    """
    def __init__(self, n_trees=10):
        self.n_trees = n_trees
        self.trees = []
    
    def fit(self, X, y):
        n_samples = X.shape[0]
        
        # Create n_trees decision stumps on bootstrap samples
        for _ in range(self.n_trees):
            # Create bootstrap sample
            bootstrap_indices = np.random.choice(n_samples, n_samples, replace=True)
            X_bootstrap = X[bootstrap_indices]
            y_bootstrap = y[bootstrap_indices]
            
            # Create and train decision stump
            tree = DecisionStump()
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
    
    def predict(self, X):
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        # Use majority voting to get final prediction
        # More efficient numpy implementation instead of using Counter
        return np.sign(np.sum(tree_predictions, axis=0))
# Demonstration
def random_forest_demo():
    # Generate a non-linearly separable dataset
    from sklearn.datasets import make_classification
    
    X, y = make_classification(
        n_samples=100, 
        n_features=2, 
        n_informative=2, 
        n_redundant=0, 
        n_clusters_per_class=2, 
        random_state=42
    )
    # Convert to binary labels -1, 1
    y = np.where(y == 0, -1, 1)
    
    # Split the data into training and testing sets
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Create and train random forest classifier
    rf = RandomForestClassifier(n_trees=20)
    rf.fit(X_train, y_train)
    
    # Make predictions
    y_pred = rf.predict(X_test)
    
    # Calculate accuracy
    accuracy = np.mean(y_pred == y_test)
    print(f"Random Forest Accuracy: {accuracy:.4f}")
    
    # Visualize results
    import matplotlib.pyplot as plt
    
    # Create a mesh grid for visualization
    h = 0.02  # step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    # Predict on mesh grid
    Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary
    plt.contourf(xx, yy, Z, alpha=0.3)
    
    # Plot data points
    plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k')
    plt.title(f"Random Forest Decision Boundary (Accuracy: {accuracy:.4f})")
    plt.savefig('output/graph.png', dpi=300)
if __name__ == "__main__":
    random_forest_demo()

Getting Started

Handling Diverse Real-World Data

Preparing and Transforming Data for Machine Learning Pipelines

Understanding Supervised Learning Algorithms

Understanding Unsupervised Learning Algorithms

Advanced Machine Learning Concepts

ML Applications and Deployment in the Real World

Responsible Machine Learning: Ethics, Fairness, and Privacy

ML Interview Preparation and Case Studies

Ensemble Methods

Implement a random forest classifier

Sample answer