Skip to content

Latest commit

 

History

History
84 lines (59 loc) · 3.93 KB

readme.md

File metadata and controls

84 lines (59 loc) · 3.93 KB

How to run multi-gpu and multi-node pytorch scripts on HPC cluster with PBS job scheduler

Initial Setup

Install miniconda and (optional but suggested) set libmamba solver.

Scripts for automatic installation and set up:
./initial_setup/install_miniconda.sh -d <installation_directory>
./initial_setup/libmamba_solver.sh

If not already installed on the cluster, you can install useful things like TMUX, GIT and HTOP directly in conda base:

To conclude the setup, install your conda environment with pytorch and cuda support:

conda env install -f environment.yml

Note: adjust cuda version as needed

Single Node - Multi GPU

We use torchrun:

Pytorch code example using Distributed-Data-Parallel in ./python_scripts/single_node_multi_gpu.py

Simple PBS script, requesting 4 gpus in ./pbs_scripts/basic_qsub.sh
If you need to run the same script multiple times with different parameters (ablation studies): ./pbs_scripts/qsub_ablations.sh

PBS documentation for qsub:

Multi Node - Multi GPU

We need communication between nodes. In practice, we must set some environment variables:

  1. MASTER_ADDR: address of the master node.
  2. MASTER_PORT: free communication port on the master node.
  3. WORLD_SIZE: total number of processes used (usually num_gpu * num_nodes).
  4. NODE_RANK: number rank, different for each node (master is usually 0).

Suggested Reading:

The same thing can be done in two ways:

  1. torchrun (similar to single node, multi gpu)
  2. openmpi: https://www.open-mpi.org/

All the environment variables and setup is managed in the scripts:

  • ./pbs_scripts/multinode_torchrun.sh
  • ./pbs_scripts/multinode_mpirun.sh

The former needs to ssh to each node and execute torchrun from there.
The second loads openmpi library with module load: https://modules.readthedocs.io/en/latest/.
Check both scripts and adapt as needed.

Both scripts can be run by adapting ./pbs_scripts/qsub_ablations_multinode.sh

Useful PBS Commands to monitor Queues and Jobs

  • qstat -wan1: monitor all jobs (in all states) and their nodes.
  • qstat -wan1 -u $user: monitor launched jobs and requested resources by $user.
  • qstat -wrn1: monitor all running jobs.
  • qstat -wan1 | grep Q: filter only queued jobs.
  • qstat -q: overview on all queues.
  • qstat -fQ: see details of queues.
  • qstat -u $user: see all jobs submitted by $user.
  • qstat -f $jobid: see details of specific job.
  • qstat -u $user | grep "$user" | cut -d"." -f1 | xargs qdel: kill all jobs of $user.
  • qstat -$user | grep "R" | cut -d"." -f1 | xargs qdel: kill all the running jobs of $user.
  • qstat -u $user | grep "Q" | cut -d"." -f1 | xargs qdel: kill all the queued jobs of $user.
  • pbsnodes -aSj: see all nodes on the cluster, the jobs running on each and free resources.
  • pbsnodes -aSj | head -n 1 && pbsnodes -aSj | grep anode: filter only on anodes.
  • pbsnodes -aSj | head -n 1 && pbsnodes -aSj | grep gnode | grep free && pbsnodes -aSj | grep gnode | grep various: see all free or various gnodes.