Feature selection is
the machine learning
task
of selecting from a data set
the features
that are relevant
for the prediction of a given target.
The hisel
package
provides feature selection methods
based on
Hilbert-Schmidt Independence Criterion.
In particular,
it provides an implementation of the HSIC Lasso algorithm of
Yamada, M. et al. (2012).
HSIC Lasso is an excellent algorihtm for feature selection.
This makes hisel
an accurate tool in your machine learning modelling.
Moreover,
hisel
implements clever routines
that address common causes of poor accuracy in other feature selection methods.
Examples of where hisel
outperforms the methods in
sklearn.feature_selection
are given in the notebooks
ensemble-example.ipynb
and
nonlinear-transform.ipynb
.
A crucial step in the HSIC Lasso algorithm
is the computation of
certain Gram matrices.
hisel
implemets such computations
in a highly vectorised and performant way.
Moreover,
hisel
allows you to
accelerate these computations
using a GPU.
The image below shows
the average run time
of the computations
of Gram matrices
via
hisel
on CPU,
via
hisel
on GPU,
and
via
pyHSICLasso.
The performance has been measured
on the computation
of Gram matrices required
by HSIC Lasso
for the selection
from a dataset of 300 features
with as many samples as reported on the x-axis.
Getting started with hisel
is as straightforward as the following code snippet:
>>> import pandas as pd
>>> import hisel
>>> df = pd.read_csv('mydata.csv')
>>> xdf = df.iloc[:, :-1]
>>> yser = df.iloc[:, -1]
>>> hisel.feature_selection.select_features(xdf, yser)
['d2', 'd7', 'c3', 'c10', 'c12', 'c24', 'c22', 'c21', 'c5']
If you are not interested in more details,
please read no further.
If you would like to
explore more about
how to tune the hyper-parameters used by hisel
or
how to have more advanced control on hisel
's selection,
please browse the examples in
examples/
and in
notebooks.
The package hisel is available from PyPi.
You can install it via pip
:
pip install hisel
If you want to install the extra support for GPU computations, you can do
pip install hisel[cudaXXX]
where cudaXXX
is one of the following:
cuda102
if you have version 10.2 of cuda-toolkit;
cuda110
if you have version 11.0 of cuda-toolkit;
cuda111
if you have version 11.1 of cuda-toolkit;
cuda11x
if you have version 11.2 - 11.8 of cuda-toolkit;
cuda12x
if you have version 12.x of cuda-toolkit.
Checkout the repo and navigate to the root directory. Then,
poetry install
You need to have cuda-toolkit installed and you need to know its version. To know that, you can do
nvidia-smi
and read the cuda version from the top right corner of the table that is printed out.
Once you know your version of cuda
, do
poetry install -E cudaXXX
where cudaXXX
is one of the following:
cuda102
if you have version 10.2;
cuda110
if you have version 11.0;
cuda111
if you have version 11.1;
cuda11x
if you have version 11.2 - 11.8;
cuda12x
if you have version 12.x.
This aligns to the installation guide of CuPy.