[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test' #126

wilmar-thomas · 2024-02-05T10:43:32Z

Describe the bug

I don't know if this is the intended behaviour or not, but when setting train_or_test parameter to 'test', data is first split into train/test and then fit on training set, cf lines 302-310:

        if self.train_or_test.lower() == 'test':
            # keeping the same naming convenetion as to not add complexit later on
            self.X_boruta_train, self.X_boruta_test, self.y_train, self.y_test, self.w_train, self.w_test = train_test_split(self.X_boruta,
                                                                                                                                self.y,
                                                                                                                                self.sample_weight,
                                                                                                                                test_size=0.3,
                                                                                                                                random_state=self.random_state,
                                                                                                                                stratify=self.stratify)
            self.Train_model(self.X_boruta_train, self.y_train, sample_weight = self.w_train)

However, X_boruta_test is not used anywhere else, in fact the whole dataset X is used to derive feature importance, regardless of chosen train_or_test, cf. lines 856 and 873 for importance_measure == 'shap' :

self.shap_values = np.array(explainer.shap_values(self.X_boruta))
self.shap_values = explainer.shap_values(self.X_boruta)

and line 815 for importance_measure == 'perm' :

perm_importances_ = permutation_importance(self.model, self.X, self.y, scoring='f1')

While for SHAP this may not constitute a big difference, according to this post, this does not correspond to what is recommended here for permutation feature importance.

Granted X and X_train are not exactly the same but still share 70% of the samples so I'm wondering if this is the intended behaviour. Could anyone provide some guidance on this?

Thank you for your help.

The text was updated successfully, but these errors were encountered:

wilmar-thomas added the bug Something isn't working label Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test' #126

[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test' #126

wilmar-thomas commented Feb 5, 2024

[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test' #126

[BUG] Feature importance not evaluated on test set when setting train_or_test = 'test' #126

Comments

wilmar-thomas commented Feb 5, 2024

Describe the bug