Understanding the SHAP technique

Along with cutting-edge modeling techniques such as XGBoost, the practice of explaining model predictions has undergone substantial development in recent years. So far, we’ve learned that logistic regression coefficients, or feature importances from random forests, can provide insight into the reasons for model predictions. A more powerful technique for explaining model predictions was described in a 2017 paper, A Unified Approach to Interpreting Model Predictions, by Scott Lundberg and Su-In Lee . This technique is known as SHAP (SHapley Additive exPlanations) as it is based on earlier work by mathematician Lloyd Shapley. Shapely developed an area of game theory to understand how coalitions of players can contribute to the overall outcome of a game. Recent machine learning research into model explanation leveraged this concept to consider how groups or coalitions of features in a predictive model contribute to the output model prediction. By considering the contribution of different groups of features, the SHAP method can isolate the effect of individual features.

Some notable aspects of using SHAP values to explain model predictions include:

  • SHAP values can be used to make individualized explanations of model predictions; in other words, the prediction of a single sample, in terms of the contribution of each feature, can be understood using SHAP. This is in contrast to the feature importance method of explaining random forests that we’ve already seen, which only considers the average importance of a feature across the model training set.

  • SHAP values are calculated relative to a background dataset. By default, this is the training data, although other datasets can be supplied.

  • SHAP values are additive, meaning that for the prediction of an individual sample, the SHAP values can be added up to recover the value of the prediction, for example, a predicted probability.

Implementation of the SHAP method

There are different implementations of the SHAP method for various types of models and here we will focus on SHAP for trees (Lundberg et al., 2019) to get insights into XGBoost model predictions on our validation set of synthetic data. First, let’s refit xgb_model_3 from the previous section with the optimal number of max_leaves, 20:

%%time 
xgb_model_3.set_params(**{'max_leaves':20}) 
xgb_model_3.fit(X_train, y_train, eval_set=eval_set, eval_metric='auc', verbose=False,\
early_stopping_rounds=30)

Now we’re ready to start calculating SHAP values for the validation dataset. There are 40 features and 1,000 samples here:

X_val.shape

This should output the following:

(1000, 40)

To automatically label the plots we can make with the shap package, we’ll put the validation set features in a DataFrame with column names. We’ll use a list comprehension to make generic feature names, for example, “Feature 0, Feature 1, …” and create the FataFrame as follows:

feature_names = ['Feature {number}'.format(number=number) for number in range(X_val.shape[1])] 
X_val_df = pd.DataFrame(data=X_val, columns=feature_names) 
X_val_df.head()

The DataFrame head should look like this:

Get hands-on with 1200+ tech skills courses.