You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In an attempt to understand how the groups were created in shap.prep.stack.data(), I attempted to reproduce the grouping the function calculates on my own with the stats::hclust() function. stats::hclust() and cutree() give a different number of samples in each group when passed the shap values from a model. However, the stats::hclust() and cutree() functions will give the same number of samples per group as shap.stack.prep.data() when a column is added for sequential row ID and included as a grouping variable. Please see below for reproducible example.
library(SHAPforxgboost)
library(xgboost)
library(DALEX)
library(caret)
# use apartments data from DALEX
data("apartments")
head(apartments)
dummy <- dummyVars(" ~ .", data=apartments)
final_df <- data.frame(predict(dummy, newdata=apartments))
head(final_df)
X1 = as.matrix(final_df[,-1])
mod1 = xgboost::xgboost(
data = X1, label = apartments$m2.price, gamma = 0, eta = 1,
lambda = 0, nrounds = 1, verbose = FALSE)
shap_values <- shap.values(xgb_model = mod1, X_train = X1)
shap_values$mean_shap_score
shap_values_appts <- shap_values$shap_score
plot_data <- shap.prep.stack.data(shap_contrib = shap_values_appts,
n_groups = 4)
summary(as.factor(plot_data$group))
#1 2 3 4
#606 92 215 87
# calculate clusters with hclust() as is done internally to shap.prep.stack.data
# include the scaling that shap.prep.stack.data performs
h <- hclust(dist(scale(shap_values_appts)), method = "ward.D")
groups <- cutree(h, 4)
summary(as.factor(groups))
# 2 3 4
#307 336 270 8
# add row ID column to shap values data frame and recalculate
# the number of samples in each group will reproduce (groups identities are just shuffled)
shap_values_appts_id <- shap_values_appts
shap_values_appts_id$ID <- seq(1, nrow(shap_values_appts_id))
h2 <- hclust(dist(scale(shap_values_appts_id)), method = "ward.D")
groups2 <- cutree(h2, 4)
summary(as.factor(groups2))
#1 2 3 4
#215 606 87 92
The text was updated successfully, but these errors were encountered:
In an attempt to understand how the groups were created in
shap.prep.stack.data()
, I attempted to reproduce the grouping the function calculates on my own with thestats::hclust()
function.stats::hclust()
andcutree()
give a different number of samples in each group when passed the shap values from a model. However, thestats::hclust()
andcutree()
functions will give the same number of samples per group asshap.stack.prep.data()
when a column is added for sequential row ID and included as a grouping variable. Please see below for reproducible example.The text was updated successfully, but these errors were encountered: