Search⌘ K
AI Features

Gradient Boosting Tree

Explore how Gradient Boosting Decision Trees (GBDT) work as a boosting ensemble method. Understand boosting versus bagging, model regularization, and hands-on implementation using Scikit-Learn, enhancing your ability to apply GBDT for complex data patterns.

What is an ensemble method?

Ensemble methods are a group of Machine Learning methods that use multiple learning algorithms to gain better predictive performance than one single method. Generally speaking, there are three types of ensembleboosting, bagging, and stacking. In this course, we show ensemble and boosting. In this lesson, we will learn one method of boosting, gradient boosting tree (GBDT).

The core principle of bagging is to build many weak estimators; Each estimator was trained and predicted independently. The final result is the combination of their predictions. If it’s a regression, it’s the average of the results. If it’s a category, it’s a vote for the results. Random Forest is one of the methods, the normal decision tree is the weak estimator.

By contrast, in the boosting method, base estimators are built sequentially, and one tries to reduce the bias of the combined estimator. There is an interdependence between the weak estimators. The motivation is to combine several weak models to produce a powerful ensemble.

From the perspective of learning theory, bagging reduces the variance of the model, while boosting reduces the deviation of the model.

Before we get hands-on, let’s look at the GBDT first. It helps us understand the algorithm a little bit better.

What is GBDT?

Gradient Tree Boosting or Gradient Boosted Decision Trees (GBDT) is a generalization of boosting to arbitrary differentiable loss functions. It’s a widely used method in many fields because it’s an accurate and effective method for both regression and classification tasks. Typically, the gradient tree uses the decision tree as the weak estimator.

Generic gradient boosting at the m-th step would fit a decision tree hm(x)h_{m}(x) to residuals. Let JmJ_{m} be the number of its leaves. The tree partitions the input space into JmJ_{m} disjoint regions R1m,,RJm\displaystyle R_{1m},\ldots ,R_{J_{m}} and predicts a constant value in each region. The output of hm(x)h_{m}(x) for input xx would be:

hm(x)=j=1Jmbjm1Rjm(x)h_{m}(x)=\sum_{j=1}^{J_{m}} b_{j m} \mathbf{1}_{R_{j m}}(x)

where bjmb_{jm} is the value predicted in the region RjmR_{jm}

Then the coefficients bjm\displaystyle b_{jm} are multiplied by some value γm\gamma_{m}, chosen using line search so as to minimize the loss function, and the model is updated as follows:

Fm(x)=Fm1(x)+γmhm(x),γm=argmini=1nL(yi,Fm1(xi)+γhm(xi))F_{m}(x)=F_{m-1}(x)+\gamma_{m} h_{m}(x), \quad \gamma_{m}={\arg \min } \sum_{i=1}^{n} L\left(y_{i}, F_{m-1}\left(x_{i}\right)+\gamma h_{m}\left(x_{i}\right)\right)

Hopefully, these formulas don’t scare you, I just want to show you what boosting means. When we build a GBDT, we are building a lot of decision trees (just like the hm(x)h_{m}(x)), one by one. Different from the general decision tree, the decision tree here is to fit the residuals of the previous step, rather than the real value. In other words, it is to reduce the error in every step(tree).

How does regularization work in GBDT?

We mentioned in the lesson about decision trees, one of the disadvantages of tree-based models is that they are easy to overfit. So, regularization is very important for them. If you don’t care about it, you may get a perfect performance on your train data. But they do very poorly on the test set, and because of their excellent fitting ability, they fit the noise as well.

There are many regularization methods for GBDT, below are some of them:

  • Shrinkage: You may notice there is a γ\gamma when updating Fm(x)F_{m}(x). An important part of the gradient boosting method is regularization by shrinkage which consists of modifying the update rule.

Fm(x)=Fm1(x)+νγmhm(x),0<ν1F_{m}(x)=F_{m-1}(x)+\nu \cdot \gamma_{m} h_{m}(x), \quad 0<\nu \leq 1

  • Penalize Complexity of tree: You can add a penalty on your tree, such as the depth of the tree, the nodes of the tree, the weight of each leaf.
  • The number of instances in leaves: Limit the minimum number of observations in leaves. Imposing this limit helps to reduce variance in predictions at leaves.
  • Subsample of feature and instance: In every step(tree), select some samples and features at random to build the tree.

Modeling on breast cancer dataset

First, we can try the GBDT on the breast cancer dataset. We can skip loading the dataset as you can try the complete code later. Let’s focus on how to create a GBDT.

Create a GBDT classifier from ensemble module by using GradientBoostingClassifier and then fit it. This model allows you to set many parameters, however, we only set n_estimators=20 in this demo. In such a small dataset, twenty trees seems like enough.

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(n_estimators=20, random_state=10)
# train_x, train_y are your training data and labels.
gb.fit(train_x, train_y)
Python 3.5
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
import numpy as np
X, y = datasets.load_breast_cancer(return_X_y=True)
print("breast cancer data size is {}".format(X.shape))
print("breast cancer target size is {}".format(y.shape))
train_x, test_x, train_y, test_y = train_test_split(X,
y,
test_size=0.2,
random_state=42)
gb = GradientBoostingClassifier(n_estimators=20, random_state=10)
gb.fit(train_x, train_y)
pred_y = gb.predict(test_x)
cr = metrics.classification_report(test_y, pred_y)
print(cr)
  • From line 8 to line 12, the breast cancer dataset is loaded and split into two parts.

  • A GBDT object is created from GradientBoostingClassifier with n_estimators=20 at line 17. Then this object is created at line 18.

  • This example uses the classification_report to evaluate the performance of the model. You could check the output from the line 21.

Modeling on nonlinear dataset

You probably hear a lot about the nonlinear capabilities of tree models. The way trees are split is not linear, so this means that they may perform well on linearly indivisible data sets. Let’s have a look at a generated dataset, which we have tried in SVM.

First, let’s create a dataset that is not linearly separable. If you plot data with Matplotlib, you might notice the data is not linearly separable. You can see the image below:

X, y = datasets.make_circles(200, noise=0.2, factor=0.1, random_state=12)

Then, let’s create a GBDT model and fit it. In this example, we would draw a decision boundary to show how the GBDT separates the data as the image shows below. The red line is the decision boundary generated by your GBDT, which separates the two types of data points nicely. We will skip the code here, and show you the complete code below.

Python 3.5
import sklearn.datasets as datasets
import sklearn.metrics as metrics
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
import numpy as np
X, y = datasets.make_circles(200, noise=0.2, factor=0.1, random_state=12)
fig, axe = plt.subplots()
axe.scatter(X[:, 0], X[:, 1], c=y)
gb2 = GradientBoostingClassifier(n_estimators=20, random_state=10)
gb2.fit(X, y)
xlim = axe.get_xlim()
ylim = axe.get_ylim()
xlin = np.linspace(xlim[0], xlim[1], 30)
ylin = np.linspace(ylim[0], ylim[1], 30)
Yc, Xc = np.meshgrid(ylin, xlin)
xy = np.vstack([Xc.ravel(), Yc.ravel()]).T
P = gb2.decision_function(xy).reshape(Xc.shape)
axe.contour(Xc, Yc, P, levels=[0], colors='r')
fig.savefig("output/img.png", dpi = 300)
plt.close(fig)
  • First, a nonlinear dataset is created using make_circles at line 7. You could see the plot above. The code from line 9 to 10 is used to plot the dataset.

  • Then a GBDT object is created at line 12 by GradientBoostingClassifier and trained(fit).

  • From line 13 to line 23, the decision boundary is plotted, just like the red circle above. At first, a mesh grid is created at line 20 based on your dataset’s X/Y number range.

A Mesh grid is a rectangular grid made of two given one-dimensional arrays representing the Cartesian indexing or Matrix indexing. In this example, the two arrays are xlin and ylin.

  • Then, the GBDT object uses each datapoint in the mesh grid as the input to do the prediction at line 22. The P is the output of your model based on each datapoint of the mesh grid. Then P is used as the basis for contour, just like line 23. For more information about the contour, you can check contour.

Please launch the widget to open the Jupyter. There you can see more information about GBDT.

Please login to launch live app!