This is a tutorial material to use Optuna in the ABCI infrastructure (unofficial).
This tutorial describes:
- How to launch Optuna storage on an interactive node.
- How to parallelize single node ML training.
- How to parallelize multi-node, MPI-based ML training.
$ GROUP=<YOUR_GROUP>
$ qrsh -g $GROUP -l rt_C.small=1 -l h_rt=12:00:00
$ module load singularity/2.6.1
$ singularity build postgres.img docker://postgres
$ mkdir postgres_data
$ singularity run -B postgres_data:/var/lib/postgresql/data postgres.img /docker-entrypoint.sh postgres
The RDB URL is as follows:
$ STORAGE_HOST=<HOST_WHERE_POSTGRES_IS_RUNNING> # e.g., STORAGE_HOST=g0002
$ STORAGE_URL=postgres://postgres@$STORAGE_HOST:5432/
Build the Horovod image and run a container:
$ module load singularity/2.6.1
$ singularity pull docker://uber/horovod:0.15.2-tf1.12.0-torch1.0.0-py3.5
$ singularity shell --nv horovod-0.15.2-tf1.12.0-torch1.0.0-py3.5.simg
With the container, install Python dependencies under the user directory:
$ pip install --user mpi4py psycopg2-binary
# hvd.broadcast_variables is not supported in the old version of Horovod
$ pip install --user -U horovod
To deal with MPI-based learning, you need to install a developing branch of Optuna, because the MPIStudy class has not been merged to the master.
$ pip uninstall optuna # If you've already installed Optuna.
$ pip install --user git+https://github.com/pfnet/optuna.git@horovod-examples
Let's parallelize a simple Optuna script that optimizes a quadratic function.
Set up the RDB URL and create a study identifier:
$ STORAGE_HOST=<HOST_WHERE_POSTGRES_IS_RUNNING>
$ STORAGE_URL=postgres://postgres@$STORAGE_HOST:5432/
$ STUDY_NAME=`~/.local/bin/optuna create-study --storage $STORAGE_URL`
Set up a shell script for qsub command, e.g.:
$ echo "module load singularity/2.6.1" >> run_quadratic.sh
$ echo "singularity shell --nv horovod-0.15.2-tf1.12.0-torch1.0.0-py3.5.simg" >> run_quadratic.sh
$ echo "python abci-optuna-horovod-example/quadratic.py $STUDY_NAME $STORAGE_URL" >> run_quadratic.sh
You can parallelize the optimization just by submitting multiple jobs. For example, the following commands simultaneously run three workers in a study.
$ GROUP=<YOUR_GROUP>
$ qsub -g $GROUP -l rt_C.small=1 run_quadratic.sh
$ qsub -g $GROUP -l rt_C.small=1 run_quadratic.sh
$ qsub -g $GROUP -l rt_C.small=1 run_quadratic.sh
You can list the history of optimization as follows.
$ python print_study_history.py $STUDY_NAME $STORAGE_URL
Let's parallelize a script written in Horovod and TensorFlow.
Download MNIST data:
$ wget -O ~/mnist.npz https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Here, we'll run the example with interactive node. (You also can consolidate the following commands as a batch job.)
$ GROUP=<YOUR_GROUP>
$ qrsh -g $GROUP -l rt_F=1 -l h_rt=01:00:00
Run a container:
$ module load singularity/2.6.1
$ singularity shell --nv horovod-0.15.2-tf1.12.0-torch1.0.0-py3.5.simg
Create a study identifier in the container:
$ GROUP=<YOUR_GROUP>
$ STORAGE_HOST=<HOST_WHERE_POSTGRES_IS_RUNNING>
$ STORAGE_URL=postgres://postgres@$STORAGE_HOST:5432/
$ STUDY_NAME=`~/.local/bin/optuna create-study --storage $STORAGE_URL`
To run the MPI example:
$ mpirun -np 2 -bind-to none -map-by slot -- python tensorflow_mnist_eager_optuna.py $STUDY_NAME $STORAGE_URL
You can list the history of optimization as follows.
$ python print_study_history.py $STUDY_NAME $STORAGE_URL