Understanding the SHAP technique

Along with cutting-edge modeling techniques such as XGBoost, the practice of explaining model predictions has undergone substantial development in recent years. So far, we’ve learned that logistic regression coefficients, or feature importances from random forests, can provide insight into the reasons for model predictions. A more powerful technique for explaining model predictions was described in a 2017 paper, A Unified Approach to Interpreting Model Predictions, by Scott Lundberg and Su-In Lee . This technique is known as SHAP (SHapley Additive exPlanations) as it is based on earlier work by mathematician Lloyd Shapley. Shapely developed an area of game theory to understand how coalitions of players can contribute to the overall outcome of a game. Recent machine learning research into model explanation leveraged this concept to consider how groups or coalitions of features in a predictive model contribute to the output model prediction. By considering the contribution of different groups of features, the SHAP method can isolate the effect of individual features.

Some notable aspects of using SHAP values to explain model predictions include:

SHAP values can be used to make individualized explanations of model predictions; in other words, the prediction of a single sample, in terms of the contribution of each feature, can be understood using SHAP. This is in contrast to the feature importance method of explaining random forests that we’ve already seen, which only considers the average importance of a feature across the model training set.
SHAP values are calculated relative to a background dataset. By default, this is the training data, although other datasets can be supplied.
SHAP values are additive, meaning that for the prediction of an individual sample, the SHAP values can be added up to recover the value of the prediction, for example, a predicted probability.

Implementation of the SHAP method

There are different implementations of the SHAP method for various types of models and here we will focus on SHAP for trees (Lundberg et al., 2019) to get insights into XGBoost model predictions on our validation set of synthetic data. First, let’s refit xgb_model_3 from the previous section with the optimal number of max_leaves, 20:

%%time 
xgb_model_3.set_params(**{'max_leaves':20}) 
xgb_model_3.fit(X_train, y_train, eval_set=eval_set, eval_metric='auc',

...

Introduction

Data Exploration and Cleaning

(Challenge) Exploring Remaining Financial Features in Dataset

Introduction to scikit-learn and Model Evaluation

Fake News Detection Using Scikit-learn

(Challenge) Logistic Regression and Precision-Recall Curve

Details of Logistic Regression and Feature Extraction

(Challenge) Logistic Regression Model and Coefficients

The Bias-Variance Trade-Off

(Challenge) Cross-Validation and Feature Engineering

Decision Trees and Random Forests

(Challenge) Cross-Validation Grid Search with Random Forest

Gradient Boosting, XGBoost, and SHAP Values

(Challenge) XGBoost and SHAP Explanation for Case Study Data

Predict Frog Toxicity with Python and XGBoost

Test Set Analysis, Financial Insights, and Delivery to the Client

(Challenge) Deriving Financial Insights

Appendix

Explaining Model Predictions with SHAP Values

Understanding the SHAP technique

Implementation of the SHAP method