Multi-gpu prompt tuning hanging when running in kube cluster #271

gkumbhat · 2023-11-20T14:52:52Z

Description

We are using torch distributed elastic launch method to kick off training on multi-gpu single node environment. It seems to be working fine when running locally, i.e in a machine that has multi-gpu available, it also works fine on single GPU but it hangs when we provide WORLD_SIZE, MASTER_ADDR, MASTER_PORT parameters. There seems to be some issue with the master address/port configuration where its trying to connect with the GPU but keeps waiting.

Run command:

ALLOW_DOWNLOADS=true  WORLD_SIZE=2 RANK=0 MASTER_ADDR=localhost MASTER_PORT=25590  python3 run_peft_tuning.py PROMPT_TUNING --dataset "glue/rte"  --model_name google/flan-t5-xl --num_epochs 1 --verbose --prompt_tuning_init TEXT  --output_dir prompt_prefixes/flan_t5_xl_1_epoch_rte_16_batch_1_acc_hf_trainer --learning_rate 0.3 --batch_size=16 --accumulate_steps 1 --max_target_length 512 --max_source_length 2048 --torch_dtype bfloat16

Relevant code to launch the training:

Launch config generation: https://github.com/gkumbhat/caikit-nlp/blob/add_pt_hf_trainer/caikit_nlp/toolkit/torch_run.py#L66
Invoking elastic launch: https://github.com/gkumbhat/caikit-nlp/blob/add_pt_hf_trainer/caikit_nlp/modules/text_generation/peft_prompt_tuning.py#L515

The text was updated successfully, but these errors were encountered:

MEllis-github · 2023-11-20T17:40:30Z

@gkumbhat Are you running a second instance of that command anywhere else, or what is the rationale for setting WORLD_SIZE to 2? From a cursory glance, the first process could be waiting for the second which has never been started.
Also from the links provided, it appears caikit is using torch multiprocessing...
Are WORLD_SIZE, RANK, MASTER_ADDR, or MASTER_PORT set in the environment prior to running this command, and if so what are their values?

gkumbhat · 2023-11-20T19:33:18Z

@MEllis-github I was setting up WORLD_SIZE as 2 to allow use of 2 GPUs by different processes, that should still work right? 🤔

I am setting these as environment variables, I tried couple of values for MASTER_ADDR:

localhost - since is running within same pod, which worked on local machines with 2 GPUs
<hostname for the pod> - thinking that it might be connecting to lower level cuda process, which might recognize the pod with the name

for MASTER_PORT, I basically used different different values, thinking if there is a conflict.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu prompt tuning hanging when running in kube cluster #271

Multi-gpu prompt tuning hanging when running in kube cluster #271

gkumbhat commented Nov 20, 2023

MEllis-github commented Nov 20, 2023

gkumbhat commented Nov 20, 2023

Multi-gpu prompt tuning hanging when running in kube cluster #271

Multi-gpu prompt tuning hanging when running in kube cluster #271

Comments

gkumbhat commented Nov 20, 2023

Description

MEllis-github commented Nov 20, 2023

gkumbhat commented Nov 20, 2023