Search⌘ K
AI Features

Gradient Boosting: Implementation from Scratch

Explore the step-by-step implementation of gradient boosting for regression. Understand how to initialize predictions, calculate residuals, train weak learners, and iteratively improve the model using gradient descent. This lesson guides you through building a strong ensemble model from scratch, providing insights into the role of learning rate, residuals, and iterations in boosting performance.

Gradient boosting is a machine learning technique that builds a strong predictive model by sequentially adding weak models to the ensemble.

It uses gradient descent optimizationGradient descent is an optimization algorithm that minimizes a loss function by iteratively updating model parameters in the direction of the steepest decrease (negative gradient). to minimize the loss function of the model. The term “gradient” refers to the negative gradient of the loss function, which guides the learning process toward reducing errors. Gradient boosting is a versatile technique that can be used for both regression and classification tasks.

In this lesson, we’ll focus on regression and demonstrate how to solve a regression problem using two approaches.

Algorithm

Think of gradient boosting as a team of assistants, where each new assistant’s sole job is to correct the mistakes of the entire team that came before it.

  1. Make an initial guess: The process begins with a simple, initial prediction for every data point, typically using the average value of the target variable.

  2. Measure the errors (residuals): Calculate the error (or residual) for every data point by finding the difference between the actual value and the current prediction. This error is the new target that the next model must learn.

  3. Train a new, simple model: A new, weak predictive model (like a small decision tree) is trained specifically to predict the errors identified in the previous step.

  4. Update the overall prediction: The prediction from the new, error-correcting model is added to the previous overall prediction, often scaled by a small learning rate to ensure cautious, steady improvement.

  5. Repeat the cycle: Steps 2 through 4 are repeated for a specified number of iterations, with each new model focusing on correcting the remaining errors of the combined ensemble.

  6. Form the final predictor: The final, powerful prediction is the sum of the initial guess and all the sequential error-correcting models trained during the process.

Implementation from scratch

We can generate a synthetic regression dataset using the make_regression function of sklearn as a starting point. By doing this, we can implement gradient boosting on the dataset and compare the results with the gradient boosting regressor provided by scikit-learn library.

Python 3.10.4
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=10, noise=1, random_state=42)
print(X.shape,y.shape)

Then, we split the dataset using the train_test_split function of sklearn.model_selection so that we can train our dataset using a gradient boosting regressor.

Python 3.10.4
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Initialization: Setting up the predictor

First, we initialize our predictor with a constant cc, which in most cases is the mean of the target variable. So, we have to create a list or array having the same size as our input array, where each element of the list contains the constant cc. Here’s the code that initializes a random predictor by setting all predicted values to the mean of the input list:

Python 3.10.4
import numpy as np
# initialization of random predictor
def initial_prediction(y):
mean = []
a = np.mean(y)
for i in range(0, len(y)):
mean.append(a)
return mean
y_i = initial_prediction(y_train)
print(len(y_i))
print(y_i[:5])

Following is the explanation for the code above:

  • Lines 4–9: We define a function initial_prediction(y) that takes a list y as input and returns a new list where each element is set to the mean value of the input list, i.e., y.

  • Lines 11–13: We call the initial_prediction() function on the list of target variables and then print its length and first five elements.

Calculating negative gradients (residuals)

The second step of the algorithm is to calculate negative gradients (also known as residuals) of the loss function with respect to the current predictor. Consider the loss function as the squared error for simplification (but we can choose different errors too):

Loss=12(yiy^i)2\begin{align*} Loss = \frac{1}{2} (y_i - \hat{y}_i)^2 \end{align*}

Here, yiy_i ...