Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-gpu prompt tuning hanging when running in kube cluster #271

Open
gkumbhat opened this issue Nov 20, 2023 · 2 comments
Open

Multi-gpu prompt tuning hanging when running in kube cluster #271

gkumbhat opened this issue Nov 20, 2023 · 2 comments

Comments

@gkumbhat
Copy link
Collaborator

Description

We are using torch distributed elastic launch method to kick off training on multi-gpu single node environment. It seems to be working fine when running locally, i.e in a machine that has multi-gpu available, it also works fine on single GPU but it hangs when we provide WORLD_SIZE, MASTER_ADDR, MASTER_PORT parameters. There seems to be some issue with the master address/port configuration where its trying to connect with the GPU but keeps waiting.

Run command:

ALLOW_DOWNLOADS=true  WORLD_SIZE=2 RANK=0 MASTER_ADDR=localhost MASTER_PORT=25590  python3 run_peft_tuning.py PROMPT_TUNING --dataset "glue/rte"  --model_name google/flan-t5-xl --num_epochs 1 --verbose --prompt_tuning_init TEXT  --output_dir prompt_prefixes/flan_t5_xl_1_epoch_rte_16_batch_1_acc_hf_trainer --learning_rate 0.3 --batch_size=16 --accumulate_steps 1 --max_target_length 512 --max_source_length 2048 --torch_dtype bfloat16

Relevant code to launch the training:

@MEllis-github
Copy link

@gkumbhat Are you running a second instance of that command anywhere else, or what is the rationale for setting WORLD_SIZE to 2? From a cursory glance, the first process could be waiting for the second which has never been started.
Also from the links provided, it appears caikit is using torch multiprocessing...
Are WORLD_SIZE, RANK, MASTER_ADDR, or MASTER_PORT set in the environment prior to running this command, and if so what are their values?

@gkumbhat
Copy link
Collaborator Author

@MEllis-github I was setting up WORLD_SIZE as 2 to allow use of 2 GPUs by different processes, that should still work right? 🤔

I am setting these as environment variables, I tried couple of values for MASTER_ADDR:

  1. localhost - since is running within same pod, which worked on local machines with 2 GPUs
  2. <hostname for the pod> - thinking that it might be connecting to lower level cuda process, which might recognize the pod with the name

for MASTER_PORT, I basically used different different values, thinking if there is a conflict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants