Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stuck on torch.distributed.barrier() #195

Open
bibibabibo26 opened this issue May 6, 2024 · 0 comments
Open

stuck on torch.distributed.barrier() #195

bibibabibo26 opened this issue May 6, 2024 · 0 comments

Comments

@bibibabibo26
Copy link

the message is:
cd /amax/yt26/VCM/LLaMA2-Accessory ; /amax/yt26/.conda/envs/accessory/bin/python /amax/yt26/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 58291 -- /amax/yt26/.conda/envs/accessory/bin/torchrun --master_port 1112 --nproc_per_node 2 /amax/yt26/VCM/LLaMA2-Accessory/accessory/main_finetune.py --output_dir output_dir/finetune/mm/alpacaLlava_llamaQformerv2_7B --epochs 3 --warmup_epochs 0.2 --batch_size 4 --accum_iter 2 --num_workers 16 --max_words 512 --lr 0.00003 --min_lr 0.000005 --clip_grad 2 --weight_decay 0.02 --data_parallel fsdp --model_parallel_size 2 --checkpointing --llama_type llama_qformerv2_peft --llama_config checkpoint/mm/alpacaLlava_llamaQformerv2/7B_params.json accessory/configs/model/finetune/sg/llamaPeft_normBiasLora.json --tokenizer_path checkpoint/mm/alpacaLlava_llamaQformerv2/tokenizer.model --pretrained_path checkpoint/mm/alpacaLlava_llamaQformerv2 --pretrained_type consolidated --data_config accessory/configs/data/finetune/mm/alpaca_llava_copy.yaml
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/amax/yt26/VCM/LLaMA2-Accessory/accessory/main_finetune.py:41: UserWarning: cannot import FusedAdam from apex, use torch AdamW instead
warnings.warn("cannot import FusedAdam from apex, use torch AdamW instead")
/amax/yt26/VCM/LLaMA2-Accessory/accessory/main_finetune.py:41: UserWarning: cannot import FusedAdam from apex, use torch AdamW instead
warnings.warn("cannot import FusedAdam from apex, use torch AdamW instead")
| distributed init (rank 0): env://, gpu 0
| distributed init (rank 1): env://, gpu 1

and the program stuck on here. when the program debug and it stuck on the misc.py line 145 torch.distributed.barrier(). How can i deal with that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant