Study notebooks - Comparison with Boruta #25

claudio-tw · 2023-05-01T18:00:18Z

Context

This PR contains notebooks used to compare HSIC Lasso with Boruta.
More precisely, in the directory notebooks/study/ you will find nonlinear.ipynb and ensemble.ipynb.

nonlinear.ipynb continues the example that exposed the superiority of HSIC Lasso versus skelarn.feature_selection.mutual_info_regression. It shows that Boruta is capable to perform the right selection too. Notice however that it is slower, and it can only handle 1D target, whereas HSIC Lasso can be used for multi-dimensional targets too.

ensemble.ipynb contains the example that exposed the weakness of HSIC Lasso, and the reason why my investigation into feature selection is still ongoing (with the MINE-based approach). It is a classification task with only categorical features. None of sklearn.feature_selection.f_ classif, sklearn.feature_selection.mutual_info_classif, and hisel.select give good selections. I have observed a few runs where boruta gives good selections, but they are not robust and good runs cannot be distinguished from bad runs without comparing the results to the ground truth.

For the implementation of Boruta, we rely on arfs. This is not the "official" implementation of scikit-learn-contrib/boruta_py, but it is more advanced and better performing. The author of arfs is also the author of the PR to scikit-learn-contrib/boruta_py that proposes the advances to the "official" implementation, i.e. Implements sample_weight and optional permutation and SHAP importance, categorical features, boxplot #100. Beside the superiority of arfs's implementation, we decided not to use scikit-learn-contrib/boruta_py because of the incompatibility of its numpy version with the version used by hisel.

Checklist

Change meets or does not compromise the Baseline Security Requirements

claudio-tw added 4 commits April 30, 2023 21:57

Dependence on ensemble - failure of HSIC

fcd3e5c

Update notebook

cf3226b

Study notebooks - Comparison iwth boruta

ec3ced7

hselstudy conda env file

7ee51d6

claudio-tw added the change:standard Not an emergency or impactful change label May 1, 2023

claudio-tw requested a review from AlxdrPolyakov as a code owner May 1, 2023 18:00

claudio-tw and others added 3 commits May 2, 2023 08:49

Notebook cosmetics

f60a7d0

Merge branch 'ensemble-example' into arfs-comparison

8fd4ce5

Merge branch 'trunk' into arfs-comparison

5a98496

AlxdrPolyakov approved these changes May 2, 2023

View reviewed changes

AlxdrPolyakov merged commit 29bb1b8 into trunk May 2, 2023

AlxdrPolyakov deleted the arfs-comparison branch May 2, 2023 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Study notebooks - Comparison with Boruta #25

Study notebooks - Comparison with Boruta #25

claudio-tw commented May 1, 2023 •

edited

Loading

Study notebooks - Comparison with Boruta #25

Study notebooks - Comparison with Boruta #25

Conversation

claudio-tw commented May 1, 2023 • edited Loading

Context

Checklist

claudio-tw commented May 1, 2023 •

edited

Loading