Skip to content

Commit

Permalink
[Add] documentation and example for parallel computation (automl#322)
Browse files Browse the repository at this point in the history
* Add documenation and example for parallel computation

* Update examples/40_advanced/example_parallel_n_jobs.py
  • Loading branch information
ravinkohli authored Nov 15, 2021
1 parent 28e1d47 commit 96de622
Show file tree
Hide file tree
Showing 3 changed files with 89 additions and 4 deletions.
21 changes: 21 additions & 0 deletions docs/manual.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,24 @@ Auto-PyTorch allows users to inspect the training results and statistics. The fo
>>> automl = TabularClassificationTask()
>>> automl.fit(X_train, y_train)
>>> automl.show_models()

Parallel computation
====================

In it's default mode, *Auto-PyTorch* already uses two cores. The first one is used for model building, the second for building an ensemble every time a new machine learning model has finished training.

Nevertheless, *Auto-PyTorch* also supports parallel Bayesian optimization via the use of `Dask.distributed <https://distributed.dask.org/>`_. By providing the arguments ``n_jobs`` to the estimator construction, one can control the number of cores available to *Auto-PyTorch* (As shown in the Example :ref:`sphx_glr_examples_40_advanced_example_parallel_n_jobs.py`). When multiple cores are available, *Auto-PyTorch* will create a worker per core, and use the available workers to both search for better machine learning models as well as building an ensemble with them until the time resource is exhausted.

**Note:** *Auto-PyTorch* requires all workers to have access to a shared file system for storing training data and models.

*Auto-PyTorch* employs `threadpoolctl <https://github.com/joblib/threadpoolctl/>`_ to control the number of threads employed by scientific libraries like numpy or scikit-learn. This is done exclusively during the building procedure of models, not during inference. In particular, *Auto-PyTorch* allows each pipeline to use at most 1 thread during training. At predicting and scoring time this limitation is not enforced by *Auto-PyTorch*. You can control the number of resources
employed by the pipelines by setting the following variables in your environment, prior to running *Auto-PyTorch*:

.. code-block:: shell-session
$ export OPENBLAS_NUM_THREADS=1
$ export MKL_NUM_THREADS=1
$ export OMP_NUM_THREADS=1
For further information about how scikit-learn handles multiprocessing, please check the `Parallelism, resource management, and configuration <https://scikit-learn.org/stable/computing/parallelism.html>`_ documentation from the library.
4 changes: 0 additions & 4 deletions examples/40_advanced/README.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,3 @@ Advanced Tabular Dataset Examples
=================================

Advanced examples for using *Auto-PyTorch* on tabular datasets.
We explain
1. How to customise the search space
2. How to split the data according to different resampling strategies
3. How to visualize the results of Auto-PyTorch
68 changes: 68 additions & 0 deletions examples/40_advanced/example_parallel_n_jobs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
"""
======================
Tabular Classification
======================
The following example shows how to fit a sample classification model parallely on 2 cores
with AutoPyTorch
"""
import os
import tempfile as tmp
import warnings

os.environ['JOBLIB_TEMP_FOLDER'] = tmp.gettempdir()
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'

warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

import sklearn.datasets
import sklearn.model_selection

from autoPyTorch.api.tabular_classification import TabularClassificationTask

if __name__ == '__main__':
############################################################################
# Data Loading
# ============
X, y = sklearn.datasets.fetch_openml(data_id=40981, return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X,
y,
random_state=1,
)

############################################################################
# Build and fit a classifier
# ==========================
api = TabularClassificationTask(
n_jobs=2,
seed=42,
)

############################################################################
# Search for an ensemble of machine learning algorithms
# =====================================================
api.search(
X_train=X_train,
y_train=y_train,
X_test=X_test.copy(),
y_test=y_test.copy(),
optimize_metric='accuracy',
total_walltime_limit=300,
func_eval_time_limit_secs=50,
# Each one of the 2 jobs is allocated 3GB
memory_limit=3072,
)

############################################################################
# Print the final ensemble performance
# ====================================
print(api.run_history, api.trajectory)
y_pred = api.predict(X_test)
score = api.score(y_pred, y_test)
print(score)
# Print the final ensemble built by AutoPyTorch
print(api.show_models())

0 comments on commit 96de622

Please sign in to comment.