Skip to content

Latest commit

 

History

History
71 lines (51 loc) · 2.67 KB

docker.md

File metadata and controls

71 lines (51 loc) · 2.67 KB

Horovod in Docker

To streamline the installation process on GPU machines, we have published the reference Dockerfile so you can get started with Horovod in minutes. The container includes Examples in the /examples directory.

Building

Before building, you can modify Dockerfile to your liking, e.g. select a different CUDA, TensorFlow or Python version.

$ mkdir horovod-docker
$ wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/uber/horovod/master/Dockerfile
$ docker build -t horovod:latest horovod-docker

Running on a single machine

After the container is built, run it using nvidia-docker.

$ nvidia-docker run -it horovod:latest
root@c278c88dd552:/examples# mpirun -np 4 -H localhost:4 python keras_mnist_advances.py

You may notice that this command does not have a few options recommended in other parts of documentation: -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH. These options are already set by default in the Docker container so you don't need to repeat them in the command..

If you don't run your container in privileged mode, you may see the following message:

[a8c9914754d2:00040] Read -1, expected 131072, errno = 1

You can ignore this message.

Running on multiple machines

Here we describe a simple example involving a shared filesystem /mnt/share using a common port number 12345 for the SSH daemon that will be run on all the containers. /mnt/share/ssh would contain a typical id_rsa and authorized_keys pair that allows passwordless authentication.

Note: These are not hard requirements but they make the example more concise. A shared filesystem can be replaced by rsyncing SSH configuration and code across machines, and a common SSH port can be replaced by machine-specific ports defined in /root/.ssh/ssh_config file.

Primary worker:

host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# mpirun -np 16 -H host1:4,host2:4,host3:4,host4:4 \
    -mca plm_rsh_args "-p 12345" python keras_mnist_advanced.py

Secondary workers:

host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"