forked from NorskRegnesentral/shapr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
200 lines (146 loc) · 10.3 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
output: github_document
bibliography: ./inst/REFERENCES.bib
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
tidy = "styler"
)
```
# shapr <img src="man/figures/NR-logo_utvidet_r32g60b136_small.png" align="right" height="50px"/>
<!-- badges: start -->
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version-last-release/shapr)](https://cran.r-project.org/package=shapr)
[![CRAN_Downloads_Badge](https://cranlogs.r-pkg.org/badges/grand-total/shapr)](https://cran.r-project.org/package=shapr)
[![R build status](https://github.com/NorskRegnesentral/shapr/workflows/R-CMD-check/badge.svg)](https://github.com/NorskRegnesentral/shapr/actions?query=workflow%3AR-CMD-check)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/license/mit/)
[![DOI](https://joss.theoj.org/papers/10.21105/joss.02027/status.svg)](https://doi.org/10.21105/joss.02027)
<!-- badges: end -->
## Brief NEWS
### Breaking change (June 2023)
As of version 0.2.3.9000, the development version of shapr (master branch on GitHub from June 2023) has been severely restructured, introducing a new syntax for explaining models, and thereby introducing a range of breaking changes. This essentially amounts to using a single function (`explain()`) instead of two functions (`shapr()` and `explain()`).
The CRAN version of `shapr` (v0.2.2) still uses the old syntax.
See the [NEWS](https://github.com/NorskRegnesentral/shapr/blob/master/NEWS.md) for details.
The examples below uses the new syntax.
[Here](https://github.com/NorskRegnesentral/shapr/blob/cranversion_0.2.2/README.md) is a version of this README with the syntax of the CRAN version (v0.2.2).
### Python wrapper
As of version 0.2.3.9100 (master branch on GitHub from June 2023), we provide a Python wrapper (`shaprpy`) which allows explaining python models with the methodology implemented in `shapr`, directly from Python. The wrapper is available [here](https://github.com/NorskRegnesentral/shapr/tree/master/python). See also details in the [NEWS](https://github.com/NorskRegnesentral/shapr/blob/master/NEWS.md).
## Introduction
The most common machine learning task is to train a model which is able to predict an unknown outcome (response variable) based on a set of known input variables/features.
When using such models for real life applications, it is often crucial to understand why a certain set of features lead to exactly that prediction.
However, explaining predictions from complex, or seemingly simple, machine learning models is a practical and ethical question, as well as a legal issue. Can I trust the model? Is it biased? Can I explain it to others? We want to explain individual predictions from a complex machine learning model by learning simple, interpretable explanations.
Shapley values is the only prediction explanation framework with a solid theoretical foundation (@lundberg2017unified). Unless the true distribution of the features are known, and there are less than say 10-15 features, these Shapley values needs to be estimated/approximated.
Popular methods like Shapley Sampling Values (@vstrumbelj2014explaining), SHAP/Kernel SHAP (@lundberg2017unified), and to some extent TreeSHAP (@lundberg2018consistent), assume that the features are independent when approximating the Shapley values for prediction explanation. This may lead to very inaccurate Shapley values, and consequently wrong interpretations of the predictions. @aas2019explaining extends and improves the Kernel SHAP method of @lundberg2017unified to account for the dependence between the features, resulting in significantly more accurate approximations to the Shapley values.
[See the paper for details](https://arxiv.org/abs/1903.10464).
This package implements the methodology of @aas2019explaining.
The following methodology/features are currently implemented:
- Native support of explanation of predictions from models fitted with the following functions
`stats::glm`, `stats::lm`,`ranger::ranger`, `xgboost::xgboost`/`xgboost::xgb.train` and `mgcv::gam`.
- Accounting for feature dependence
* assuming the features are Gaussian (`approach = 'gaussian'`, @aas2019explaining)
* with a Gaussian copula (`approach = 'copula'`, @aas2019explaining)
* using the Mahalanobis distance based empirical (conditional) distribution approach (`approach = 'empirical'`, @aas2019explaining)
* using conditional inference trees (`approach = 'ctree'`, @redelmeier2020explaining).
* using the endpoint match method for time series (`approach = 'timeseries'`, @jullum2021efficient)
* using the joint distribution approach for models with purely cateogrical data (`approach = 'categorical'`, @redelmeier2020explaining)
* assuming all features are independent (`approach = 'independence'`, mainly for benchmarking)
- Combining any of the above methods.
- Explain *forecasts* from time series models at different horizons with `explain_forecast()` (R only)
- Batch computation to reduce memory consumption significantly
- Parallelized computation using the [future](https://future.futureverse.org/) framework. (R only)
- Progress bar showing computation progress, using the [`progressr`](https://progressr.futureverse.org/) package. Must be activated by the user.
- Optional use of the AICc criterion of @hurvich1998smoothing when optimizing the bandwidth parameter in the empirical (conditional) approach of @aas2019explaining.
- Functionality for visualizing the explanations. (R only)
- Support for models not supported natively.
<!--
Current methodological restrictions:
- The features must follow a continuous distribution
- Discrete features typically work just fine in practice although the theory breaks down
- Ordered/unordered categorical features are not supported
Future releases will include:
- Computational improvement of the AICc optimization approach,
- Adaptive selection of method to account for the feature dependence.
-->
Note the prediction outcome must be numeric.
All approaches except `approach = 'categorical'` works for numeric features, but unless the models are very gaussian-like, we recommend `approach = 'ctree'` or `approach = 'empirical'`, especially if there are discretely distributed features.
When the models contains both numeric and categorical features, we recommend `approach = 'ctree'`.
For models with a smaller number of categorical features (without many levels) and a decent training set, we recommend `approach = 'categorical'`.
For (binary) classification based on time series models, we suggest using `approach = 'timeseries'`.
To explain forecasts of time series models (at different horizons), we recommend using `explain_forecast()` instead of `explain()`.
The former has a more suitable input syntax for explaining those kinds of forecasts.
See the [vignette](https://norskregnesentral.github.io/shapr/articles/understanding_shapr.html) for details and further examples.
Unlike SHAP and TreeSHAP, we decompose probability predictions directly to ease the interpretability, i.e. not via log odds transformations.
## Installation
To install the current stable release from CRAN (note, using the old explanation syntax), use
```{r, eval = FALSE}
install.packages("shapr")
```
To install the current development version (with the new explanation syntax), use
```{r, eval = FALSE}
remotes::install_github("NorskRegnesentral/shapr")
```
If you would like to install all packages of the models we currently support, use
```{r, eval = FALSE}
remotes::install_github("NorskRegnesentral/shapr", dependencies = TRUE)
```
If you would also like to build and view the vignette locally, use
```{r, eval = FALSE}
remotes::install_github("NorskRegnesentral/shapr", dependencies = TRUE, build_vignettes = TRUE)
vignette("understanding_shapr", "shapr")
```
You can always check out the latest version of the vignette [here](https://norskregnesentral.github.io/shapr/articles/understanding_shapr.html).
## Example
`shapr` supports computation of Shapley values with any predictive model which takes a set of numeric features and produces a numeric outcome.
The following example shows how a simple `xgboost` model is trained using the *airquality* dataset, and how `shapr` explains the individual predictions.
```{r basic_example, warning = FALSE}
library(xgboost)
library(shapr)
data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
ind_x_explain <- 1:6
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]
# Looking at the dependence between the features
cor(x_train)
# Fitting a basic xgboost model to the training data
model <- xgboost(
data = as.matrix(x_train),
label = y_train,
nround = 20,
verbose = FALSE
)
# Specifying the phi_0, i.e. the expected prediction without any features
p0 <- mean(y_train)
# Computing the actual Shapley values with kernelSHAP accounting for feature dependence using
# the empirical (conditional) distribution approach with bandwidth parameter sigma = 0.1 (default)
explanation <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
prediction_zero = p0
)
# Printing the Shapley values for the test data.
# For more information about the interpretation of the values in the table, see ?shapr::explain.
print(explanation$shapley_values)
# Finally we plot the resulting explanations
plot(explanation)
```
See the [vignette](https://norskregnesentral.github.io/shapr/articles/understanding_shapr.html) for further examples.
## Contribution
All feedback and suggestions are very welcome. Details on how to contribute can be found
[here](https://norskregnesentral.github.io/shapr/CONTRIBUTING.html). If you have any questions or comments, feel
free to open an issue [here](https://github.com/NorskRegnesentral/shapr/issues).
Please note that the 'shapr' project is released with a
[Contributor Code of Conduct](https://norskregnesentral.github.io/shapr/CODE_OF_CONDUCT.html).
By contributing to this project, you agree to abide by its terms.
## References