Introduction: XGBoost

Get introduced to our topics for this chapter: scalable end-to-end tree boosting systems called XGBoost and LightGBM.

Overview

After reading this chapter, you will be able to describe the concept of gradient boosting, the fundamental idea underlying the XGBoost package. You will then train XGBoost models on synthetic data, while learning about early stopping as well as several XGBoost hyperparameters along the way. In addition to using a similar method to grow trees as we have previously (by setting max_depth), you’ll also discover a new way of growing trees that is offered by XGBoost: loss-guided tree growing. After learning about XGBoost, you’ll then be introduced to a new and powerful way of explaining model predictions, called SHAP (SHapley Additive exPlanations). You will see how SHAP values can be used to provide individualized explanations for model predictions from any dataset, not just the training data, and also understand the additive property of SHAP values.

As we saw in the previous chapter, decision trees and ensemble models based on them provide powerful methods for creating machine learning models. While random forests have been around for decades, recent work on a different kind of tree ensemble, gradient boosted trees, has resulted in state-of-the-art models that have come to dominate the landscape of predictive modeling with tabular data, or data that is organized into a structured table, similar to the case study data.

Packages for accurate predictive models

The two main packages used by machine learning data scientists today to create the most accurate predictive models with tabular data are XGBoost and LightGBM. In this chapter, we’ll become familiar with XGBoost using a synthetic dataset, and then apply it to the case study data in the activity.

Get hands-on with 1200+ tech skills courses.