Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multinode training #39

Open
budzianowski opened this issue Jun 17, 2024 · 0 comments
Open

Multinode training #39

budzianowski opened this issue Jun 17, 2024 · 0 comments

Comments

@budzianowski
Copy link

budzianowski commented Jun 17, 2024

Hey, thanks for your awesome project! I want to run some multi-node training with the following setup:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

# Get the list of node names
nodes=($(scontrol show hostnames $SLURM_JOB_NODELIST))
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# Set environment variables for distributed training
MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
MASTER_PORT=29501
WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
RANK=$SLURM_PROCID
LOCAL_RANK=$SLURM_LOCALID

export MASTER_ADDR
export MASTER_PORT
export WORLD_SIZE
export RANK
export LOCAL_RANK


echo "Node IP: $head_node_ip"
export LOGLEVEL=INFO

srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=$SLURM_NTASKS_PER_NODE \
    --rdzv_id=$RANDOM \
    --rdzv_backend=c10d \
    --rdzv_conf=timeout=9000 \
    --rdzv_endpoint=$head_node_ip:$MASTER_PORT \
     scripts/pretrain.py
     ....

I'm running into issues like:
Duplicate GPU detected : rank 2 and rank 10 both on CUDA device 50000
Could you share the setup for multinode training that works for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant