https://s3-us-west-2.amazonaws.com/secure.notion-static.com/60004c81-3392-4aa6-90c5-f3ee5e8cde45/shap_header.png

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f67b6093-e7f8-40b6-8390-d83f948e0dc4/68747470733a2f2f7472617669732d63692e6f72672f736c756e64626572672f736861702e7376673f6272616e63683d6d6173746572

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/c6691a6a-31f4-472c-80b5-79aae16fc3af/68747470733a2f2f6d7962696e6465722e6f72672f62616467655f6c6f676f2e737667

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/bc42e00a-48b9-42d3-a872-73323ea4c66a/68747470733a2f2f72656164746865646f63732e6f72672f70726f6a656374732f736861702f62616467652f3f76657273696f6e3d6c6174657374

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions (see papers for details and citations).

Install

Shap can be installed from either PyPI or conda-forge:

pip install shap
or
conda install -c conda-forge shap

Tree ensemble example with TreeExplainer (XGBoost/LightGBM/CatBoost/scikit-learn/pyspark models)

While SHAP can explain the output of any machine learning model, we have developed a high-speed exact algorithm for tree ensemble methods (see our Nature MI paper). Fast C++ implementations are supported for XGBoost, LightGBM, CatBoost, scikit-learn and pyspark tree models:

import xgboost
import shap

# load JS visualization code to notebook
shap.initjs()

# train XGBoost model
X,y = shap.datasets.boston()
model = xgboost.train({"learning_rate": 0.01}, xgboost.DMatrix(X, label=y), 100)

# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/173a252a-01e0-4fa6-a7e2-c44b5acfb920/boston_instance.png

The above explanation shows features each contributing to push the model output from the base value (the average model output over the training dataset we passed) to the model output. Features pushing the prediction higher are shown in red, those pushing the prediction lower are in blue (these force plots are introduced in our Nature BME paper).

If we take many explanations such as the one shown above, rotate them 90 degrees, and then stack them horizontally, we can see explanations for an entire dataset (in the notebook this plot is interactive):

# visualize the training set predictions
shap.force_plot(explainer.expected_value, shap_values, X)

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/16c200ca-d357-4ef5-aba7-6333efe02375/boston_dataset.png

To understand how a single feature effects the output of the model we can plot the SHAP value of that feature vs. the value of the feature for all the examples in a dataset. Since SHAP values represent a feature's responsibility for a change in the model output, the plot below represents the change in predicted house price as RM (the average number of rooms per house in an area) changes. Vertical dispersion at a single value of RM represents interaction effects with other features. To help reveal these interactions dependence_plot automatically selects another feature for coloring. In this case coloring by RAD (index of accessibility to radial highways) highlights that the average number of rooms per house has less impact on home price for areas with a high RAD value.

# create a dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot("RM", shap_values, X)

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/0e60398f-b658-40c3-a666-5023cafa10b8/boston_dependence_plot.png

To get an overview of which features are most important for a model we can plot the SHAP values of every feature for every sample. The plot below sorts features by the sum of SHAP value magnitudes over all samples, and uses SHAP values to show the distribution of the impacts each feature has on the model output. The color represents the feature value (red high, blue low). This reveals for example that a high LSTAT (% lower status of the population) lowers the predicted home price.