Skip to content

Commit

Permalink
Merge pull request #104 from EpistasisLab/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
nickotto authored Oct 13, 2023
2 parents 07a187a + d7ff57f commit 031ae09
Show file tree
Hide file tree
Showing 14 changed files with 479 additions and 80 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,6 @@ dask-worker-space/
*.egg-info/
.coverage
target/
.venv/
.venv/
build/*
*.egg
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,21 @@ This is to ensure that you get the version that is compatible with your system.
conda install --yes -c conda-forge 'lightgbm>=3.3.3'
```

### Installing Extra Features with pip

If you want to utilize the additional features provided by TPOT2 along with `scikit-learn` extensions, you can install them using `pip`. The command to install TPOT2 with these extra features is as follows:

```
pip install tpot2[sklearnex]
```

Please note that while these extensions can speed up scikit-learn packages, there are some important considerations:

These extensions may not be fully developed and tested on Arm-based CPUs, such as M1 Macs. You might encounter compatibility issues or reduced performance on such systems.

We recommend using Python 3.9 when installing these extra features, as it provides better compatibility and stability.


### Developer/Latest Branch Installation


Expand Down
31 changes: 16 additions & 15 deletions Tutorial/3_Genetic_Feature_Set_Selectors.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@
"source": [
"The FeatureSetSelector is a subclass of sklearn.feature_selection.SelectorMixin that simply returns the manually specified columns. The parameter sel_subset specifies the name or index of the column that it selects. The transform function then simply indexes and returns the selected columns. You can also optionally name the group with the name parameter, though this is only for note keeping and does is not used by the class.\n",
"\n",
"```\n",
"\n",
"sel_subset: list or int\n",
" If X is a dataframe, items in sel_subset list must correspond to column names\n",
" If X is a numpy array, items in sel_subset list must correspond to column indexes\n",
" int: index of a single column\n",
"```\n",
"\n",
"\n"
]
},
Expand Down Expand Up @@ -75,51 +75,52 @@
"source": [
"To use the FSS with TPOT2, you can simply pass it in to the configuration dictionary. Note that the FSS is only well defined when used in the leaf nodes of the graph. This is because downstream nodes will receive different transformations of the data such that the original indexes no longer correspond to the same columns in the raw data.\n",
"\n",
"TPOT2 includsing the string `\"feature_set_selector\"` in the `leaf_config_dict` parameter will include the FSS in the search space of the pipeline. By default, each FSS node will select a single column. You can also group columns into sets so that each node selects a set of features rather than a single feature.\n",
"TPOT2 includsing the string \"feature_set_selector\" in the leaf_config_dict parameter will include the FSS in the search space of the pipeline. By default, each FSS node will select a single column. You can also group columns into sets so that each node selects a set of features rather than a single feature.\n",
"\n",
"\n",
"\n",
"``` \n",
"subsets : str or list, default=None\n",
" Sets the subsets that the FeatureSetSeletor will select from if set as an option in one of the configuration dictionaries.\n",
" - str : If a string, it is assumed to be a path to a csv file with the subsets. \n",
" The first column is assumed to be the name of the subset and the remaining columns are the features in the subset.\n",
" - list or np.ndarray : If a list or np.ndarray, it is assumed to be a list of subsets.\n",
" - None : If None, each column will be treated as a subset. One column will be selected per subset.\n",
" If subsets is None, each column will be treated as a subset. One column will be selected per subset.\n",
"```\n",
"\n",
"\n",
"Lets say you want to have three groups of features, each with three columns each. The following examples are equivalent:\n",
"\n",
"### str\n",
"\n",
"`sel_subsets=simple_fss.csv`\n",
"sel_subsets=simple_fss.csv\n",
"\n",
"\n",
"\\# simple_fss.csv\n",
"```\n",
"\n",
"group_one, 1,2,3\n",
"\n",
"group_two, 4,5,6\n",
"\n",
"group_three, 7,8,9\n",
"```\n",
"\n",
"\n",
"### dict\n",
"\n",
"```\n",
"\n",
"sel_subsets = { \"group_one\" : [1,2,3],\n",
" \"group_two\" : [4,5,6],\n",
" \"group_three\" : [7,8,9],\n",
" }\n",
"```\n",
"\n",
"\n",
"### list\n",
"\n",
"```\n",
"\n",
"sel_subsets = [[1,2,3],[4,5,6],[7,8,9]]\n",
" \n",
"```\n",
"\n",
"\n",
"(As the FSS is just another transformer, you could also pass it in with the standard configuration dictionary format (described in tutorial 2), in which you would have to define your own function that returns a hyperparameter. Similar to the `params_LogisticRegression` function below. )\n",
"\n",
"(As the FSS is just another transformer, you could also pass it in with the standard configuration dictionary format (described in tutorial 2), in which you would have to define your own function that returns a hyperparameter. Similar to the params_LogisticRegression function below. )\n",
"\n",
"\n",
"(In the future, FSS will be treated as a special case node with its own mutation/crossover functions to make it more efficient when there are large numbers of features.)"
Expand Down Expand Up @@ -1132,7 +1133,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
"version": "3.10.11"
},
"orig_nbformat": 4,
"vscode": {
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ def calculate_version():
extras_require={
'skrebate': ['skrebate>=0.3.4'],
'mdr': ['scikit-mdr>=0.4.4'],
'sklearnex' : ['scikit-learn-intelex>=2023.2.1']
},
classifiers=[
'Intended Audience :: Science/Research',
Expand Down
6 changes: 6 additions & 0 deletions tpot2/config/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@
from .autoqtl_builtins import make_FeatureEncodingFrequencySelector_config_dictionary, make_genetic_encoders_config_dictionary
from .hyperparametersuggestor import *

try:
from .classifiers_sklearnex import make_sklearnex_classifier_config_dictionary
from .regressors_sklearnex import make_sklearnex_regressor_config_dictionary
except ModuleNotFoundError: #if optional packages are not installed
pass

try:
from .mdr_configs import make_skrebate_config_dictionary, make_MDR_config_dictionary, make_ContinuousMDR_config_dictionary
except: #if optional packages are not installed
Expand Down
73 changes: 73 additions & 0 deletions tpot2/config/classifiers_sklearnex.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
from sklearnex.ensemble import RandomForestClassifier
from sklearnex.neighbors import KNeighborsClassifier
from sklearnex.svm import SVC
from sklearnex.svm import NuSVC
from sklearnex.linear_model import LogisticRegression


def params_RandomForestClassifier(trial, name=None):
return {
'n_estimators': 100,
'bootstrap': trial.suggest_categorical(name=f'bootstrap_{name}', choices=[True, False]),
'min_samples_split': trial.suggest_int(f'min_samples_split_{name}', 2, 20),
'min_samples_leaf': trial.suggest_int(f'min_samples_leaf_{name}', 1, 20),
'n_jobs': 1,
}

def params_KNeighborsClassifier(trial, name=None, n_samples=10):
n_neighbors_max = max(n_samples, 100)
return {
'n_neighbors': trial.suggest_int(f'n_neighbors_{name}', 1, n_neighbors_max, log=True ),
'weights': trial.suggest_categorical(f'weights_{name}', ['uniform', 'distance']),
}

def params_LogisticRegression(trial, name=None):
params = {}
params['dual'] = False
params['penalty'] = 'l2'
params['solver'] = trial.suggest_categorical(name=f'solver_{name}', choices=['liblinear', 'sag', 'saga']),
if params['solver'] == 'liblinear':
params['penalty'] = trial.suggest_categorical(name=f'penalty_{name}', choices=['l1', 'l2'])
if params['penalty'] == 'l2':
params['dual'] = trial.suggest_categorical(name=f'dual_{name}', choices=[True, False])
else:
params['penalty'] = 'l1'
return {
'solver': params['solver'],
'penalty': params['penalty'],
'dual': params['dual'],
'C': trial.suggest_float(f'C_{name}', 1e-4, 1e4, log=True),
'max_iter': 1000,
}

def params_SVC(trial, name=None):
return {
'kernel': trial.suggest_categorical(name=f'kernel_{name}', choices=['poly', 'rbf', 'linear', 'sigmoid']),
'C': trial.suggest_float(f'C_{name}', 1e-4, 25, log=True),
'degree': trial.suggest_int(f'degree_{name}', 1, 4),
'class_weight': trial.suggest_categorical(name=f'class_weight_{name}', choices=[None, 'balanced']),
'max_iter': 3000,
'tol': 0.005,
'probability': True,
}

def params_NuSVC(trial, name=None):
return {
'nu': trial.suggest_float(f'subsample_{name}', 0.05, 1.0),
'kernel': trial.suggest_categorical(name=f'kernel_{name}', choices=['poly', 'rbf', 'linear', 'sigmoid']),
'C': trial.suggest_float(f'C_{name}', 1e-4, 25, log=True),
'degree': trial.suggest_int(f'degree_{name}', 1, 4),
'class_weight': trial.suggest_categorical(name=f'class_weight_{name}', choices=[None, 'balanced']),
'max_iter': 3000,
'tol': 0.005,
'probability': True,
}

def make_sklearnex_classifier_config_dictionary(n_samples=10, n_classes=None):
return {
RandomForestClassifier: params_RandomForestClassifier,
KNeighborsClassifier: params_KNeighborsClassifier,
LogisticRegression: params_LogisticRegression,
SVC: params_SVC,
NuSVC: params_NuSVC,
}
84 changes: 84 additions & 0 deletions tpot2/config/regressors_sklearnex.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
from sklearnex.linear_model import LinearRegression
from sklearnex.linear_model import Ridge
from sklearnex.linear_model import Lasso
from sklearnex.linear_model import ElasticNet

from sklearnex.svm import SVR
from sklearnex.svm import NuSVR

from sklearnex.ensemble import RandomForestRegressor
from sklearnex.neighbors import KNeighborsRegressor


def params_RandomForestRegressor(trial, name=None):
return {
'n_estimators': 100,
'max_features': trial.suggest_float(f'max_features_{name}', 0.05, 1.0),
'bootstrap': trial.suggest_categorical(name=f'bootstrap_{name}', choices=[True, False]),
'min_samples_split': trial.suggest_int(f'min_samples_split_{name}', 2, 21),
'min_samples_leaf': trial.suggest_int(f'min_samples_leaf_{name}', 1, 21),
}

def params_KNeighborsRegressor(trial, name=None, n_samples=100):
n_neighbors_max = max(n_samples, 100)
return {
'n_neighbors': trial.suggest_int(f'n_neighbors_{name}', 1, n_neighbors_max),
'weights': trial.suggest_categorical(f'weights_{name}', ['uniform', 'distance']),
}

def params_LinearRegression(trial, name=None):
return {}

def params_Ridge(trial, name=None):
return {
'alpha': trial.suggest_float(f'alpha_{name}', 0.0, 1.0),
'fit_intercept': True,
'tol': trial.suggest_float(f'tol_{name}', 1e-5, 1e-1, log=True),
}

def params_Lasso(trial, name=None):
return {
'alpha': trial.suggest_float(f'alpha_{name}', 0.0, 1.0),
'fit_intercept': True,
'precompute': trial.suggest_categorical(f'precompute_{name}', [True, False, 'auto']),
'tol': trial.suggest_float(f'tol_{name}', 1e-5, 1e-1, log=True),
'positive': trial.suggest_categorical(f'positive_{name}', [True, False]),
'selection': trial.suggest_categorical(f'selection_{name}', ['cyclic', 'random']),
}

def params_ElasticNet(trial, name=None):
return {
'alpha': 1 - trial.suggest_float(f'alpha_{name}', 0.0, 1.0),
'l1_ratio': 1- trial.suggest_float(f'l1_ratio_{name}',0.0, 1.0),
}

def params_SVR(trial, name=None):
return {
'kernel': trial.suggest_categorical(name=f'kernel_{name}', choices=['poly', 'rbf', 'linear', 'sigmoid']),
'C': trial.suggest_float(f'C_{name}', 1e-4, 25, log=True),
'degree': trial.suggest_int(f'degree_{name}', 1, 4),
'max_iter': 3000,
'tol': 0.005,
}

def params_NuSVR(trial, name=None):
return {
'nu': trial.suggest_float(f'subsample_{name}', 0.05, 1.0),
'kernel': trial.suggest_categorical(name=f'kernel_{name}', choices=['poly', 'rbf', 'linear', 'sigmoid']),
'C': trial.suggest_float(f'C_{name}', 1e-4, 25, log=True),
'degree': trial.suggest_int(f'degree_{name}', 1, 4),
'max_iter': 3000,
'tol': 0.005,
}

def make_sklearnex_regressor_config_dictionary(n_samples=10):
return {
RandomForestRegressor: params_RandomForestRegressor,
KNeighborsRegressor: params_KNeighborsRegressor,
LinearRegression: params_LinearRegression,
Ridge: params_Ridge,
Lasso: params_Lasso,
ElasticNet: params_ElasticNet,
SVR: params_SVR,
NuSVR: params_NuSVR,
}
Loading

0 comments on commit 031ae09

Please sign in to comment.