How to run multi-gpu and multi-node pytorch scripts on HPC cluster with PBS job scheduler

Initial Setup

Install miniconda and (optional but suggested) set libmamba solver.

https://docs.anaconda.com/miniconda/install/
https://www.anaconda.com/blog/a-faster-conda-for-a-growing-community

Scripts for automatic installation and set up:
./initial_setup/install_miniconda.sh -d <installation_directory>
./initial_setup/libmamba_solver.sh

If not already installed on the cluster, you can install useful things like TMUX, GIT and HTOP directly in conda base:

https://github.com/tmux/tmux/wiki
https://git-scm.com/
https://htop.dev/

To conclude the setup, install your conda environment with pytorch and cuda support:

conda env install -f environment.yml

Note: adjust cuda version as needed

Single Node - Multi GPU

We use torchrun:

https://pytorch.org/docs/stable/elastic/run.html

Pytorch code example using Distributed-Data-Parallel in ./python_scripts/single_node_multi_gpu.py

Simple PBS script, requesting 4 gpus in ./pbs_scripts/basic_qsub.sh
If you need to run the same script multiple times with different parameters (ablation studies): ./pbs_scripts/qsub_ablations.sh

PBS documentation for qsub:

https://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/commands/qsub.htm

Multi Node - Multi GPU

We need communication between nodes. In practice, we must set some environment variables:

MASTER_ADDR: address of the master node.
MASTER_PORT: free communication port on the master node.
WORLD_SIZE: total number of processes used (usually num_gpu * num_nodes).
NODE_RANK: number rank, different for each node (master is usually 0).

Useful `PBS` Commands to monitor Queues and Jobs

qstat -wan1: monitor all jobs (in all states) and their nodes.
qstat -wan1 -u $user: monitor launched jobs and requested resources by $user.
qstat -wrn1: monitor all running jobs.
qstat -wan1 | grep Q: filter only queued jobs.
qstat -q: overview on all queues.
qstat -fQ: see details of queues.
qstat -u $user: see all jobs submitted by $user.
qstat -f $jobid: see details of specific job.
qstat -u $user | grep "$user" | cut -d"." -f1 | xargs qdel: kill all jobs of $user.
qstat -$user | grep "R" | cut -d"." -f1 | xargs qdel: kill all the running jobs of $user.
qstat -u $user | grep "Q" | cut -d"." -f1 | xargs qdel: kill all the queued jobs of $user.
pbsnodes -aSj: see all nodes on the cluster, the jobs running on each and free resources.
pbsnodes -aSj | head -n 1 && pbsnodes -aSj | grep anode: filter only on anodes.
pbsnodes -aSj | head -n 1 && pbsnodes -aSj | grep gnode | grep free && pbsnodes -aSj | grep gnode | grep various: see all free or various gnodes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

How to run multi-gpu and multi-node pytorch scripts on HPC cluster with PBS job scheduler

Initial Setup

Single Node - Multi GPU

Multi Node - Multi GPU

Useful `PBS` Commands to monitor Queues and Jobs

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

How to run multi-gpu and multi-node pytorch scripts on HPC cluster with PBS job scheduler

Initial Setup

Single Node - Multi GPU

Multi Node - Multi GPU

Useful PBS Commands to monitor Queues and Jobs

Useful `PBS` Commands to monitor Queues and Jobs