You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey, thanks for your awesome project! I want to run some multi-node training with the following setup:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
# Get the list of node names
nodes=($(scontrol show hostnames $SLURM_JOB_NODELIST))
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# Set environment variables for distributed training
MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
MASTER_PORT=29501
WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
RANK=$SLURM_PROCID
LOCAL_RANK=$SLURM_LOCALID
export MASTER_ADDR
export MASTER_PORT
export WORLD_SIZE
export RANK
export LOCAL_RANK
echo "Node IP: $head_node_ip"
export LOGLEVEL=INFO
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=$SLURM_NTASKS_PER_NODE \
--rdzv_id=$RANDOM \
--rdzv_backend=c10d \
--rdzv_conf=timeout=9000 \
--rdzv_endpoint=$head_node_ip:$MASTER_PORT \
scripts/pretrain.py
....
I'm running into issues like: Duplicate GPU detected : rank 2 and rank 10 both on CUDA device 50000
Could you share the setup for multinode training that works for you?
The text was updated successfully, but these errors were encountered:
Hey, thanks for your awesome project! I want to run some multi-node training with the following setup:
I'm running into issues like:
Duplicate GPU detected : rank 2 and rank 10 both on CUDA device 50000
Could you share the setup for multinode training that works for you?
The text was updated successfully, but these errors were encountered: