You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't know if this is the intended behaviour or not, but when setting train_or_test parameter to 'test', data is first split into train/test and then fit on training set, cf lines 302-310:
if self.train_or_test.lower() == 'test':
# keeping the same naming convenetion as to not add complexit later on
self.X_boruta_train, self.X_boruta_test, self.y_train, self.y_test, self.w_train, self.w_test = train_test_split(self.X_boruta,
self.y,
self.sample_weight,
test_size=0.3,
random_state=self.random_state,
stratify=self.stratify)
self.Train_model(self.X_boruta_train, self.y_train, sample_weight = self.w_train)
However, X_boruta_test is not used anywhere else, in fact the whole dataset X is used to derive feature importance, regardless of chosen train_or_test, cf. lines 856 and 873 for importance_measure == 'shap' :
While for SHAP this may not constitute a big difference, according to this post, this does not correspond to what is recommended here for permutation feature importance.
Granted X and X_train are not exactly the same but still share 70% of the samples so I'm wondering if this is the intended behaviour. Could anyone provide some guidance on this?
Thank you for your help.
The text was updated successfully, but these errors were encountered:
Describe the bug
I don't know if this is the intended behaviour or not, but when setting
train_or_test
parameter to 'test', data is first split into train/test and then fit on training set, cf lines 302-310:However,
X_boruta_test
is not used anywhere else, in fact the whole datasetX
is used to derive feature importance, regardless of chosentrain_or_test
, cf. lines 856 and 873 forimportance_measure == 'shap'
:and line 815 for
importance_measure == 'perm'
:While for SHAP this may not constitute a big difference, according to this post, this does not correspond to what is recommended here for permutation feature importance.
Granted X and X_train are not exactly the same but still share 70% of the samples so I'm wondering if this is the intended behaviour. Could anyone provide some guidance on this?
Thank you for your help.
The text was updated successfully, but these errors were encountered: