Study notebooks - Comparison with Boruta #25
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
This PR contains notebooks used to compare HSIC Lasso with Boruta.
More precisely, in the directory
notebooks/study/
you will findnonlinear.ipynb
andensemble.ipynb
.nonlinear.ipynb
continues the example that exposed the superiority of HSIC Lasso versusskelarn.feature_selection.mutual_info_regression
. It shows that Boruta is capable to perform the right selection too. Notice however that it is slower, and it can only handle 1D target, whereas HSIC Lasso can be used for multi-dimensional targets too.ensemble.ipynb
contains the example that exposed the weakness of HSIC Lasso, and the reason why my investigation into feature selection is still ongoing (with the MINE-based approach). It is a classification task with only categorical features. None ofsklearn.feature_selection.f_ classif
,sklearn.feature_selection.mutual_info_classif
, andhisel.select
give good selections. I have observed a few runs whereboruta
gives good selections, but they are not robust and good runs cannot be distinguished from bad runs without comparing the results to the ground truth.For the implementation of Boruta, we rely on arfs. This is not the "official" implementation of scikit-learn-contrib/boruta_py, but it is more advanced and better performing. The author of arfs is also the author of the PR to scikit-learn-contrib/boruta_py that proposes the advances to the "official" implementation, i.e. Implements sample_weight and optional permutation and SHAP importance, categorical features, boxplot #100. Beside the superiority of arfs's implementation, we decided not to use scikit-learn-contrib/boruta_py because of the incompatibility of its numpy version with the version used by
hisel
.Checklist