Merge pull request #104 from EpistasisLab/dev

Dev
EpistasisLab · Oct 13, 2023 · 031ae09 · 031ae09
2 parents 07a187a + d7ff57f
commit 031ae09
Show file tree

Hide file tree

Showing 14 changed files with 479 additions and 80 deletions.
diff --git a/.gitignore b/.gitignore
@@ -11,4 +11,6 @@ dask-worker-space/
 *.egg-info/
 .coverage
 target/
-.venv/
+.venv/
+build/*
+*.egg
diff --git a/README.md b/README.md
@@ -48,6 +48,21 @@ This is to ensure that you get the version that is compatible with your system.
 conda install --yes -c conda-forge 'lightgbm>=3.3.3'
 ```
 
+### Installing Extra Features with pip
+
+If you want to utilize the additional features provided by TPOT2 along with `scikit-learn` extensions, you can install them using `pip`. The command to install TPOT2 with these extra features is as follows:
+
+```
+pip install tpot2[sklearnex]
+```
+
+Please note that while these extensions can speed up scikit-learn packages, there are some important considerations:
+
+These extensions may not be fully developed and tested on Arm-based CPUs, such as M1 Macs. You might encounter compatibility issues or reduced performance on such systems.
+
+We recommend using Python 3.9 when installing these extra features, as it provides better compatibility and stability.
+
+
 ### Developer/Latest Branch Installation
 
 

diff --git a/Tutorial/3_Genetic_Feature_Set_Selectors.ipynb b/Tutorial/3_Genetic_Feature_Set_Selectors.ipynb
@@ -7,12 +7,12 @@
    "source": [
     "The FeatureSetSelector is a subclass of sklearn.feature_selection.SelectorMixin that simply returns the manually specified columns. The parameter sel_subset specifies the name or index of the column that it selects. The transform function then simply indexes and returns the selected columns. You can also optionally name the group with the name parameter, though this is only for note keeping and does is not used by the class.\n",
     "\n",
-    "```\n",
+    "\n",
     "sel_subset: list or int\n",
     "    If X is a dataframe, items in sel_subset list must correspond to column names\n",
     "    If X is a numpy array, items in sel_subset list must correspond to column indexes\n",
     "    int: index of a single column\n",
-    "```\n",
+    "\n",
     "\n"
    ]
   },
@@ -75,51 +75,52 @@
    "source": [
     "To use the FSS with TPOT2, you can simply pass it in to the configuration dictionary. Note that the FSS is only well defined when used in the leaf nodes of the graph. This is because downstream nodes will receive different transformations of the data such that the original indexes no longer correspond to the same columns in the raw data.\n",
     "\n",
-    "TPOT2 includsing the string `\"feature_set_selector\"` in the `leaf_config_dict` parameter will include the FSS in the search space of the pipeline. By default, each FSS node will select a single column. You can also group columns into sets so that each node selects a set of features rather than a single feature.\n",
+    "TPOT2 includsing the string \"feature_set_selector\" in the leaf_config_dict parameter will include the FSS in the search space of the pipeline. By default, each FSS node will select a single column. You can also group columns into sets so that each node selects a set of features rather than a single feature.\n",
+    "\n",
     "\n",
     "\n",
-    "```    \n",
     "subsets : str or list, default=None\n",
     "        Sets the subsets that the FeatureSetSeletor will select from if set as an option in one of the configuration dictionaries.\n",
     "        - str : If a string, it is assumed to be a path to a csv file with the subsets. \n",
     "            The first column is assumed to be the name of the subset and the remaining columns are the features in the subset.\n",
     "        - list or np.ndarray : If a list or np.ndarray, it is assumed to be a list of subsets.\n",
     "        - None : If None, each column will be treated as a subset. One column will be selected per subset.\n",
     "        If subsets is None, each column will be treated as a subset. One column will be selected per subset.\n",
-    "```\n",
+    "\n",
     "\n",
     "Lets say you want to have three groups of features, each with three columns each. The following examples are equivalent:\n",
     "\n",
     "### str\n",
     "\n",
-    "`sel_subsets=simple_fss.csv`\n",
+    "sel_subsets=simple_fss.csv\n",
     "\n",
     "\n",
     "\\# simple_fss.csv\n",
-    "```\n",
+    "\n",
     "group_one, 1,2,3\n",
+    "\n",
     "group_two, 4,5,6\n",
+    "\n",
     "group_three, 7,8,9\n",
-    "```\n",
+    "\n",
     "\n",
     "### dict\n",
     "\n",
-    "```\n",
+    "\n",
     "sel_subsets = { \"group_one\" :  [1,2,3],\n",
     "            \"group_two\" :  [4,5,6],\n",
     "            \"group_three\" :  [7,8,9],\n",
     "            }\n",
-    "```\n",
+    "\n",
     "\n",
     "### list\n",
     "\n",
-    "```\n",
+    "\n",
     "sel_subsets = [[1,2,3],[4,5,6],[7,8,9]]\n",
-    "           \n",
-    "```\n",
     "\n",
     "\n",
-    "(As the FSS is just another transformer, you could also pass it in with the standard configuration dictionary format (described in tutorial 2), in which you would have to define your own function that returns a hyperparameter. Similar to the  `params_LogisticRegression` function below. )\n",
+    "\n",
+    "(As the FSS is just another transformer, you could also pass it in with the standard configuration dictionary format (described in tutorial 2), in which you would have to define your own function that returns a hyperparameter. Similar to the  params_LogisticRegression function below. )\n",
     "\n",
     "\n",
     "(In the future, FSS will be treated as a special case node with its own mutation/crossover functions to make it more efficient when there are large numbers of features.)"
@@ -1132,7 +1133,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.10.11"
   },
   "orig_nbformat": 4,
   "vscode": {

diff --git a/setup.py b/setup.py
@@ -52,6 +52,7 @@ def calculate_version():
     extras_require={
         'skrebate': ['skrebate>=0.3.4'],
         'mdr': ['scikit-mdr>=0.4.4'],
+        'sklearnex' : ['scikit-learn-intelex>=2023.2.1']
     },
     classifiers=[
         'Intended Audience :: Science/Research',

diff --git a/tpot2/config/__init__.py b/tpot2/config/__init__.py
@@ -7,6 +7,12 @@
 from .autoqtl_builtins import make_FeatureEncodingFrequencySelector_config_dictionary, make_genetic_encoders_config_dictionary
 from .hyperparametersuggestor import *
 
+try:
+    from .classifiers_sklearnex import make_sklearnex_classifier_config_dictionary
+    from .regressors_sklearnex import make_sklearnex_regressor_config_dictionary
+except ModuleNotFoundError: #if optional packages are not installed
+    pass
+
 try:
     from .mdr_configs import make_skrebate_config_dictionary, make_MDR_config_dictionary, make_ContinuousMDR_config_dictionary
 except: #if optional packages are not installed

diff --git a/tpot2/config/classifiers_sklearnex.py b/tpot2/config/classifiers_sklearnex.py
@@ -0,0 +1,73 @@
+from sklearnex.ensemble import RandomForestClassifier
+from sklearnex.neighbors import KNeighborsClassifier
+from sklearnex.svm import SVC
+from sklearnex.svm import NuSVC
+from sklearnex.linear_model import LogisticRegression
+
+
+def params_RandomForestClassifier(trial, name=None):
+    return {
+        'n_estimators': 100,
+        'bootstrap': trial.suggest_categorical(name=f'bootstrap_{name}', choices=[True, False]),
+        'min_samples_split': trial.suggest_int(f'min_samples_split_{name}', 2, 20),
+        'min_samples_leaf': trial.suggest_int(f'min_samples_leaf_{name}', 1, 20),
+        'n_jobs': 1,
+    }
+
+def params_KNeighborsClassifier(trial, name=None, n_samples=10):
+    n_neighbors_max = max(n_samples, 100)
+    return {
+        'n_neighbors': trial.suggest_int(f'n_neighbors_{name}', 1, n_neighbors_max, log=True ),
+        'weights': trial.suggest_categorical(f'weights_{name}', ['uniform', 'distance']),
+    }
+
+def params_LogisticRegression(trial, name=None):
+    params = {}
+    params['dual'] = False
+    params['penalty'] = 'l2'
+    params['solver'] = trial.suggest_categorical(name=f'solver_{name}', choices=['liblinear', 'sag', 'saga']),
+    if params['solver'] == 'liblinear':
+        params['penalty'] = trial.suggest_categorical(name=f'penalty_{name}', choices=['l1', 'l2'])
+        if params['penalty'] == 'l2':
+            params['dual'] = trial.suggest_categorical(name=f'dual_{name}', choices=[True, False])
+        else:
+            params['penalty'] = 'l1'
+    return {
+        'solver': params['solver'],
+        'penalty': params['penalty'],
+        'dual': params['dual'],
+        'C': trial.suggest_float(f'C_{name}', 1e-4, 1e4, log=True),
+        'max_iter': 1000,
+    }
+
+def params_SVC(trial, name=None):
+    return {
+        'kernel': trial.suggest_categorical(name=f'kernel_{name}', choices=['poly', 'rbf', 'linear', 'sigmoid']),
+        'C': trial.suggest_float(f'C_{name}', 1e-4, 25, log=True),
+        'degree': trial.suggest_int(f'degree_{name}', 1, 4),
+        'class_weight': trial.suggest_categorical(name=f'class_weight_{name}', choices=[None, 'balanced']),
+        'max_iter': 3000,
+        'tol': 0.005,
+        'probability': True,
+    }
+
+def params_NuSVC(trial, name=None):
+    return {
+        'nu': trial.suggest_float(f'subsample_{name}', 0.05, 1.0),
+        'kernel': trial.suggest_categorical(name=f'kernel_{name}', choices=['poly', 'rbf', 'linear', 'sigmoid']),
+        'C': trial.suggest_float(f'C_{name}', 1e-4, 25, log=True),
+        'degree': trial.suggest_int(f'degree_{name}', 1, 4),
+        'class_weight': trial.suggest_categorical(name=f'class_weight_{name}', choices=[None, 'balanced']),
+        'max_iter': 3000,
+        'tol': 0.005,
+        'probability': True,
+    }
+
+def make_sklearnex_classifier_config_dictionary(n_samples=10, n_classes=None):
+    return {
+            RandomForestClassifier: params_RandomForestClassifier,
+            KNeighborsClassifier: params_KNeighborsClassifier,
+            LogisticRegression: params_LogisticRegression,
+            SVC: params_SVC,
+            NuSVC: params_NuSVC,
+        }
diff --git a/tpot2/config/regressors_sklearnex.py b/tpot2/config/regressors_sklearnex.py
@@ -0,0 +1,84 @@
+from sklearnex.linear_model import LinearRegression
+from sklearnex.linear_model import Ridge
+from sklearnex.linear_model import Lasso
+from sklearnex.linear_model import ElasticNet
+
+from sklearnex.svm import SVR
+from sklearnex.svm import NuSVR
+
+from sklearnex.ensemble import RandomForestRegressor
+from sklearnex.neighbors import KNeighborsRegressor
+
+
+def params_RandomForestRegressor(trial, name=None):
+    return {
+        'n_estimators': 100,
+        'max_features': trial.suggest_float(f'max_features_{name}', 0.05, 1.0),
+        'bootstrap': trial.suggest_categorical(name=f'bootstrap_{name}', choices=[True, False]),
+        'min_samples_split': trial.suggest_int(f'min_samples_split_{name}', 2, 21),
+        'min_samples_leaf': trial.suggest_int(f'min_samples_leaf_{name}', 1, 21),
+    }
+
+def params_KNeighborsRegressor(trial, name=None, n_samples=100):
+    n_neighbors_max = max(n_samples, 100)
+    return {
+        'n_neighbors': trial.suggest_int(f'n_neighbors_{name}', 1, n_neighbors_max),
+        'weights': trial.suggest_categorical(f'weights_{name}', ['uniform', 'distance']),
+        }
+
+def params_LinearRegression(trial, name=None):
+    return {}
+
+def params_Ridge(trial, name=None):
+    return {
+        'alpha': trial.suggest_float(f'alpha_{name}', 0.0, 1.0),
+        'fit_intercept': True,
+        'tol': trial.suggest_float(f'tol_{name}', 1e-5, 1e-1, log=True),
+    }
+
+def params_Lasso(trial, name=None):
+    return {
+        'alpha': trial.suggest_float(f'alpha_{name}', 0.0, 1.0),
+        'fit_intercept': True,
+        'precompute': trial.suggest_categorical(f'precompute_{name}', [True, False, 'auto']),
+        'tol': trial.suggest_float(f'tol_{name}', 1e-5, 1e-1, log=True),
+        'positive': trial.suggest_categorical(f'positive_{name}', [True, False]),
+        'selection': trial.suggest_categorical(f'selection_{name}', ['cyclic', 'random']),
+    }
+
+def params_ElasticNet(trial, name=None):
+    return {
+        'alpha': 1 - trial.suggest_float(f'alpha_{name}', 0.0, 1.0),
+        'l1_ratio': 1- trial.suggest_float(f'l1_ratio_{name}',0.0, 1.0),
+        }
+
+def params_SVR(trial, name=None):
+    return {
+        'kernel': trial.suggest_categorical(name=f'kernel_{name}', choices=['poly', 'rbf', 'linear', 'sigmoid']),
+        'C': trial.suggest_float(f'C_{name}', 1e-4, 25, log=True),
+        'degree': trial.suggest_int(f'degree_{name}', 1, 4),
+        'max_iter': 3000,
+        'tol': 0.005,
+    }
+
+def params_NuSVR(trial, name=None):
+    return {
+        'nu': trial.suggest_float(f'subsample_{name}', 0.05, 1.0),
+        'kernel': trial.suggest_categorical(name=f'kernel_{name}', choices=['poly', 'rbf', 'linear', 'sigmoid']),
+        'C': trial.suggest_float(f'C_{name}', 1e-4, 25, log=True),
+        'degree': trial.suggest_int(f'degree_{name}', 1, 4),
+        'max_iter': 3000,
+        'tol': 0.005,
+    }
+
+def make_sklearnex_regressor_config_dictionary(n_samples=10):
+    return {
+        RandomForestRegressor: params_RandomForestRegressor,
+        KNeighborsRegressor: params_KNeighborsRegressor,
+        LinearRegression: params_LinearRegression,
+        Ridge: params_Ridge,
+        Lasso: params_Lasso,
+        ElasticNet: params_ElasticNet,
+        SVR: params_SVR,
+        NuSVR: params_NuSVR,
+    }