subSAGE is a Shapley value based framework to infer feature importance in high-dimensional data. It is based on SAGE (Shapley Additive Global importancE), but adjusted for high-dimensional data. We also demonstrate how to perform paired bootstrapping in order to estimate confidence intervals. We investimate in particular subSAGE applied on tree ensemble models. We emphasize the importance of computing subSAGE on independent test data not used during training of the model.
Preprint is available here.
Given an xgboost-model, test data, and a particular feature, the subSAGE estimate can be computed, in R, as:
source("~/subSAGE/subSAGE.R")
t = xgb.model.dt.tree(model = model)
trees = as.data.table(xgboost.trees(xgb_model = model, data = data, recalculate = FALSE))
estimate = subSage_cpp(data,trees,feature,loss = "RMSE")