Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study notebooks - Comparison with Boruta #25

Merged
merged 7 commits into from
May 2, 2023
Merged

Conversation

claudio-tw
Copy link
Contributor

@claudio-tw claudio-tw commented May 1, 2023

Context

This PR contains notebooks used to compare HSIC Lasso with Boruta.
More precisely, in the directory notebooks/study/ you will find nonlinear.ipynb and ensemble.ipynb.

nonlinear.ipynb continues the example that exposed the superiority of HSIC Lasso versus skelarn.feature_selection.mutual_info_regression. It shows that Boruta is capable to perform the right selection too. Notice however that it is slower, and it can only handle 1D target, whereas HSIC Lasso can be used for multi-dimensional targets too.

ensemble.ipynb contains the example that exposed the weakness of HSIC Lasso, and the reason why my investigation into feature selection is still ongoing (with the MINE-based approach). It is a classification task with only categorical features. None of sklearn.feature_selection.f_ classif, sklearn.feature_selection.mutual_info_classif, and hisel.select give good selections. I have observed a few runs where boruta gives good selections, but they are not robust and good runs cannot be distinguished from bad runs without comparing the results to the ground truth.

For the implementation of Boruta, we rely on arfs. This is not the "official" implementation of scikit-learn-contrib/boruta_py, but it is more advanced and better performing. The author of arfs is also the author of the PR to scikit-learn-contrib/boruta_py that proposes the advances to the "official" implementation, i.e. Implements sample_weight and optional permutation and SHAP importance, categorical features, boxplot #100. Beside the superiority of arfs's implementation, we decided not to use scikit-learn-contrib/boruta_py because of the incompatibility of its numpy version with the version used by hisel.

Checklist

@claudio-tw claudio-tw added the change:standard Not an emergency or impactful change label May 1, 2023
@claudio-tw claudio-tw requested a review from AlxdrPolyakov as a code owner May 1, 2023 18:00
@AlxdrPolyakov AlxdrPolyakov merged commit 29bb1b8 into trunk May 2, 2023
@AlxdrPolyakov AlxdrPolyakov deleted the arfs-comparison branch May 2, 2023 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:standard Not an emergency or impactful change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants