SHAP for Categorical Options. Including up SHAP values of categorical… | by Conor O’Sullivan | Jun, 2022

June 21, 2022

1

Including up SHAP values of categorical options which were remodeled with one-hot encodings

Categorical options have to be remodeled earlier than they can be utilized in a mannequin. One-hot encoding is a standard manner to do that: We find yourself with a binary variable for every class. That is superb till it involves understanding the mannequin utilizing SHAP. Each binary variable can have its personal SHAP worth. This makes it obscure the general contribution of the unique categorical function.

A easy method is so as to add the SHAP values for every of the binary variables collectively. This may be interpreted because the SHAP worth for the unique categorical function. We are going to stroll you thru the Python code for doing this. We are going to see that we’re ready to make use of the SHAP aggregation plots. Nonetheless, these are restricted in relation to understanding the character of relationships of the explicit options. So, to finish we present you the way boxplots can be utilized to visualise the SHAP values.

In case you are unfamiliar with SHAP or the python package deal, I recommend studying the article under. We go in depth on interpret SHAP values. We additionally discover a few of the aggregations used on this article.

To show the issue with categorical options, we might be utilizing the mushroom classification dataset. You’ll be able to see a snapshot of this dataset in Determine 1. The goal variable is the mushroom’s class. That’s if the mushroom is toxic (p) or edible (e). You’ll find this dataset in UCI’s MLR.

Determine 1: Mushroom dataset snapshot (supply: creator) (dataset supply: UCI) (licence: CC BY 4.0)

For mannequin options, we now have 22 categorical options. For every function, the classes are represented by a letter. For instance odor has 9 distinctive categories- almond (a), anise (l), creosote (c), fishy (y), foul (f), musty (m), none (n), pungent (p), spicy (s). That is what the mushroom smells like.

We’ll stroll you thru the code used to analyse this dataset and you will discover the total script on GitHub. To begin, we might be utilizing the Python packages under. Now we have some widespread packages for dealing with and visualising knowledge (traces 2–4). We use the OneHotEncoder for reworking the explicit options (line 6). We use xgboost for modelling (line 8). Lastly, we use shap to know how our mannequin works (line 10).