Introduction || What is DDP || Single-Node Multi-GPU Training || Fault Tolerance || Multi-Node training || minGPT Training
Authors: Suraj Subramanian
.. grid:: 2 .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn :class-card: card-prerequisites - Launching multinode training jobs with ``torchrun`` - Code changes (and things to keep in mind) when moving from single-node to multinode training. .. grid:: 1 .. grid-item:: :octicon:`code-square;1.0em;` View the code used in this tutorial on `GitHub <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multinode.py>`__ .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites :class-card: card-prerequisites - Familiarity with `multi-GPU training <../beginner/ddp_series_multigpu.html>`__ and `torchrun <../beginner/ddp_series_fault_tolerance.html>`__ - 2 or more TCP-reachable GPU machines (this tutorial uses AWS p3.2xlarge instances) - PyTorch `installed <https://pytorch.org/get-started/locally/>`__ with CUDA on all machines
Follow along with the video below or on youtube.
Multinode training involves deploying a training job across several machines. There are two ways to do this:
- running a
torchrun
command on each machine with identical rendezvous arguments, or - deploying it on a compute cluster using a workload manager (like SLURM)
In this video we will go over the (minimal) code changes required to move from single-node multigpu to multinode training, and run our training script in both of the above ways.
Note that multinode training is bottlenecked by inter-node communication latencies. Running a training job on 4 GPUs on a single node will be faster than running it on 4 nodes with 1 GPU each.
In single-node settings, we were tracking the
gpu_id
of each device running our training process. torchrun
tracks this value in an environment variable LOCAL_RANK
which uniquely identifies each GPU-process on a node. For a unique identifier across all the nodes, torchrun
provides another variable
RANK
which refers to the global rank of a process.
Warning
Do not use RANK
for critical logic in your training job. When torchrun
restarts processes after a failure or membership changes, there is no guarantee
that the processes will hold the same LOCAL_RANK
and RANKS
.
Torchrun supports heteregenous scaling i.e. each of your multinode machines can have different number of GPUs participating in the training job. In the video, I deployed the code on 2 machines where one machine has 4 GPUs and the other used only 2 GPUs.
- Ensure that your nodes are able to communicate with each other over TCP.
- Set env variable
NCCL_DEBUG
toINFO
(usingexport NCCL_DEBUG=INFO
) to print verbose logs that can help diagnose the issue. - Sometimes you might need to explicitly set the network interface for
the distributed backend (
export NCCL_SOCKET_IFNAME=eth0
). Read more about this here.
- Training a GPT model with DDP (next tutorial in this series)
- Fault Tolerant distributed training (previous tutorial in this series)
- torchrun
- Rendezvous arguments
- Setting up a cluster on AWS
- Slurm docs