This example is mainly based on the offical tutorial, but more readable.
- Install PyTorch1.0 (pytorch.org).
- A machine with multiple gpus or a cluster with multiple nodes to enable distributed learning.
Note: For single node training, the --gpu_use should equal to --world_size
python main.py --dist-url 'tcp://127.0.0.1:FREEPORT' --multiprocessing-distributed --rank_start 0 --world-size 2 --gpu_use 2
Take a cluster with 2 nodes(each node has 2 gpus) for example, the rank_start of the first node should be 0 and 0+gpu_use for the second node.
The world-size should be sum of total gpu usage, since we follow the official recommendation to start each process with one gpu.
Node 0:
python main.py --dist-url 'tcp://ip:FREEPORT' --node 0 --multiprocessing-distributed --rank_start 0 --world-size 4 --gpu_use 2
Node 1:
python main.py --dist-url 'tcp://ip:FREEPORT' --node 1 --multiprocessing-distributed --rank_start 2 --world-size 4 --gpu_use 2