Skip to content

Commit

Permalink
GPU support - Installation extras of CuPy and Implementation of Cuda …
Browse files Browse the repository at this point in the history
…kernels (#41)

* GPU support - Installation extras of CuPy

* cupy dependencies

* bump hisel version

* Cuda kernels

* Examples and profilers

* Error 137 in tests

* Error 137 in tests

* Files for README
  • Loading branch information
claudio-tw authored Aug 23, 2023
1 parent e36a562 commit 76142ba
Show file tree
Hide file tree
Showing 17 changed files with 1,287 additions and 381 deletions.
123 changes: 112 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,126 @@
# hisel
Feature selection tool based on Hilbert-Schmidt Independence Criterion
# HISEL
## Feature selection tool based on Hilbert-Schmidt Independence Criterion
Feature selection is
the machine learning
task
of selecting from a data set
the features
that are relevant
for the prediction of a given target.
The `hisel` package
provides feature selection methods
based on
Hilbert-Schmidt Independence Criterion.
In particular,
it provides an implementation of the HSIC Lasso algorithm of
[Yamada, M. et al. (2012)](https://arxiv.org/abs/1202.0515).

## Why is `hisel` cool?

#### `hisel` is accurate
HSIC Lasso is an excellent algorihtm for feature selection.
This makes `hisel` an accurate tool in your machine learning modelling.
Moreover,
`hisel` implements clever routines
that address common causes of poor accuracy in other feature selection methods.

Examples of where `hisel` outperforms the methods in
[sklearn.feature\_selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)
are given in the notebooks
`ensemble-example.ipynb`
and
`nonlinear-transform.ipynb`.


#### `hisel` is fast
A crucial step in the HSIC Lasso algorithm
is the computation of
certain Gram matrices.
`hisel` implemets such computations
in a highly vectorised and performant way.
Moreover,
`hisel` allows you to
accelerate these computations
using a GPU.
The image below shows
the average run time
of the computations
of Gram matrices
via
`hisel` on CPU,
via
`hisel` on GPU,
and
via
[pyHSICLasso](https://pypi.org/project/pyHSICLasso/).

![gramtimes](gramtimes.png)


#### `hisel` has a friendly user interface

Getting started with `hisel` is as straightforward as the following code snippet:
```
>>> import pandas as pd
>>> import hisel
>>> df = pd.read_csv('mydata.csv')
>>> xdf = df.iloc[:, :-1]
>>> yser = df.iloc[:, -1]
>>> hisel.feature_selection.select_features(xdf, yser)
['d2', 'd7', 'c3', 'c10', 'c12', 'c24', 'c22', 'c21', 'c5']
```
If you are not interested in more details,
please read no further.
If you would like to
explore more about
how to tune the hyper-parameters used by `hisel`
or
how to have more advanced control on `hisel`'s selection,
please browse the examples in
[examples/](https://github.com/transferwise/hisel/tree/trunk/examples)
and in
[notebooks](https://github.com/transferwise/hisel/tree/trunk/notebooks).

This package provides an implementtion of the HSIC Lasso of [Yamada, M. et al. (2012)](https://arxiv.org/abs/1202.0515).

Usage is demontrated in the notebooks and in the scripts available under `examples/`.


## Installation

### Install via `pip`

The package `hisel` is available from `arti`. You can install it via `pip`.
While on the Wise-VPN, in the environment where you intende to sue `hisel`, just do
```
pip install hisel --index-url=https://arti.tw.ee/artifactory/api/pypi/pypi-virtual/simple
```

### Install from source

#### Basic installation:
Checkout the repo and navigate to the root directory. Then,
```
poetry install
```


#### Installation with GPU support
You need to have cuda-toolkit installed and you need to know its version.
To know that, you can do
```
nvidia-smi
```
and read the cuda version from the top right corner of the table that is printed out.
Once you know your version of `cuda`, do
```
poetry install -E cudaXXX
```
where `cudaXXX` is one of the following:
`cuda102` if you have version 10.2;
`cuda110` if you have version 11.0;
`cuda111` if you have version 11.1;
`cuda11x` if you have version 11.2 - 11.8;
`cuda12x` if you have version 12.x.
This aligns to the [installation guide of CuPy](https://docs.cupy.dev/en/stable/install.html#installing-cupy).


## Why is this cool?

Examples of where `hisel` outperforms the methods in
[sklearn.feature\_selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)
are given in the notebooks
`ensemble-example.ipynb`
and
`nonlinear-trasnform.ipynb`.
1 change: 1 addition & 0 deletions examples/feature_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@


def main():
# Minimial example of `hisel` usage
df = pd.read_csv('mydata.csv')
xdf = df.iloc[:, :-1]
yser = df.iloc[:, -1]
Expand Down
35 changes: 35 additions & 0 deletions examples/minimal_with_params.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
import pandas as pd
import hisel


def main():
# Minimial example of `hisel` usage with specification of parameters
df = pd.read_csv('mydata.csv')
xdf = df.iloc[:, :-1]
yser = df.iloc[:, -1]
categorical_search_parameters = hisel.feature_selection.SearchParameters(
num_permutations=1,
im_ratio=.03,
max_iter=2,
parallel=True,
random_state=None,
)
hsiclasso_parameters = hisel.feature_selection.HSICLassoParameters(
mi_threshold=.00001,
hsic_threshold=0.005,
batch_size=5000,
minibatch_size=500,
number_of_epochs=3,
use_preselection=True,
device=hisel.kernels.Device.CPU # if cuda is available you can pass GPU
)
results = hisel.feature_selection.select_features(
xdf, yser, hsiclasso_parameters, categorical_search_parameters)
print('\n\n##########################################################')
print(
f'The following features are relevant for the prediction of {yser.name}:')
print(f'{results.selected_features}')


if __name__ == '__main__':
main()
Binary file added gramtimes.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 76142ba

Please sign in to comment.