Gradient Boosting Tree
Explore how Gradient Boosting Decision Trees (GBDT) work as a boosting ensemble method. Understand boosting versus bagging, model regularization, and hands-on implementation using Scikit-Learn, enhancing your ability to apply GBDT for complex data patterns.
What is an ensemble method?
Ensemble methods are a group of Machine Learning methods that use multiple learning algorithms to gain better predictive performance than one single method. Generally speaking, there are three types of ensemble:boosting, bagging, and stacking. In this course, we show ensemble and boosting. In this lesson, we will learn one method of boosting, gradient boosting tree (GBDT).
The core principle of bagging is to build many weak estimators; Each estimator was trained and predicted independently. The final result is the combination of their predictions. If it’s a regression, it’s the average of the results. If it’s a category, it’s a vote for the results. Random Forest is one of the methods, the normal decision tree is the weak estimator.
By contrast, in the boosting method, base estimators are built sequentially, and one tries to reduce the bias of the combined estimator. There is an interdependence between the weak estimators. The motivation is to combine several weak models to produce a powerful ensemble.
From the perspective of learning theory, bagging reduces the variance of the model, while boosting reduces the deviation of the model.
Before we get hands-on, let’s look at the GBDT first. It helps us understand the algorithm a little bit better.
What is GBDT?
Gradient Tree Boosting or Gradient Boosted Decision Trees (GBDT) is a generalization of boosting to arbitrary differentiable loss functions. It’s a widely used method in many fields because it’s an accurate and effective method for both regression and classification tasks. Typically, the gradient tree uses the decision tree as the weak estimator.
Generic gradient boosting at the m-th step would fit a decision tree to residuals. Let be the number of its leaves. The tree partitions the input space into disjoint regions and predicts a constant value in each region. The output of for input would be:
where is the value predicted in the region
Then the coefficients are multiplied by some value , chosen using line search so as to minimize the loss function, and the model is updated as follows:
Hopefully, these formulas don’t scare you, I just want to show you what boosting means. When we build a GBDT, we are building a lot of decision trees (just like the ), one by one. Different from the general decision tree, the decision tree here is to fit the residuals of the previous step, rather than the real value. In other words, it is to reduce the error in every step(tree).
How does regularization work in GBDT?
We mentioned in the lesson about decision trees, one of the disadvantages of tree-based models is that they are easy to overfit. So, regularization is very important for them. If you don’t care about it, you may get a perfect performance on your train data. But they do very poorly on the test set, and because of their excellent fitting ability, they fit the noise as well.
There are many regularization methods for GBDT, below are some of them:
- Shrinkage: You may notice there is a when updating . An important part of the gradient boosting method is regularization by shrinkage which consists of modifying the update rule.
- Penalize Complexity of tree: You can add a penalty on your tree, such as the depth of the tree, the nodes of the tree, the weight of each leaf.
- The number of instances in leaves: Limit the minimum number of observations in leaves. Imposing this limit helps to reduce variance in predictions at leaves.
- Subsample of feature and instance: In every step(tree), select some samples and features at random to build the tree.
Modeling on breast cancer dataset
First, we can try the GBDT on the breast cancer dataset. We can skip loading the dataset as you can try the complete code later. Let’s focus on how to create a GBDT.
Create a GBDT classifier from ensemble module by using GradientBoostingClassifier and then fit it. This model allows you to set many parameters, however, we only set n_estimators=20 in this demo. In such a small dataset, twenty trees seems like enough.
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(n_estimators=20, random_state=10)
# train_x, train_y are your training data and labels.
gb.fit(train_x, train_y)
-
From
line 8toline 12, the breast cancer dataset is loaded and split into two parts. -
A GBDT object is created from
GradientBoostingClassifierwithn_estimators=20atline 17. Then this object is created atline 18. -
This example uses the
classification_reportto evaluate the performance of the model. You could check the output from theline 21.
Modeling on nonlinear dataset
You probably hear a lot about the nonlinear capabilities of tree models. The way trees are split is not linear, so this means that they may perform well on linearly indivisible data sets. Let’s have a look at a generated dataset, which we have tried in SVM.
First, let’s create a dataset that is not linearly separable. If you plot data with Matplotlib, you might notice the data is not linearly separable. You can see the image below:
X, y = datasets.make_circles(200, noise=0.2, factor=0.1, random_state=12)
Then, let’s create a GBDT model and fit it. In this example, we would draw a decision boundary to show how the GBDT separates the data as the image shows below. The red line is the decision boundary generated by your GBDT, which separates the two types of data points nicely. We will skip the code here, and show you the complete code below.
-
First, a nonlinear dataset is created using
make_circlesatline 7. You could see the plot above. The code fromline 9to10is used to plot the dataset. -
Then a GBDT object is created at
line 12byGradientBoostingClassifierand trained(fit). -
From
line 13toline 23, the decision boundary is plotted, just like the red circle above. At first, a mesh grid is created atline 20based on your dataset’s X/Y number range.
A Mesh grid is a rectangular grid made of two given one-dimensional arrays representing the Cartesian indexing or Matrix indexing. In this example, the two arrays are
xlinandylin.
- Then, the GBDT object uses each datapoint in the mesh grid as the input to do the prediction at
line 22. ThePis the output of your model based on each datapoint of the mesh grid. ThenPis used as the basis for contour, just likeline 23. For more information about thecontour, you can check contour.
Please launch the widget to open the Jupyter. There you can see more information about GBDT.