Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when parsing GPUs on a node when only specifying node name --include=node3 vs --include=node3:1,2,4 #6671

Open
stephen-nju opened this issue Oct 26, 2024 · 5 comments · May be fixed by #6698
Assignees

Comments

@stephen-nju
Copy link

The Nightly CI for {{ env.GITHUB_SERVER_URL }}/{{ env.GITHUB_REPOSITORY }}/actions/runs/{{ env.GITHUB_RUN_ID }} failed.
when using --include=node3 ,deepspeed parser error ,but --include=node3:1,2,3,4,5,6,7,8 is ok
I checked the runner.py code, when SLOT_LIST_START not in config ,the devices will set to [],but the --include arguements says "If :SLOT is omitted, include all slots on that host"
Image

@stephen-nju
Copy link
Author

the deepspeed Image

@loadams loadams self-assigned this Oct 28, 2024
@loadams
Copy link
Contributor

loadams commented Oct 28, 2024

Hi @stephen-nju - can you please update the title to better reflect the issue as the current title is copied from the CI workflow failure script and I tried to write it, but more information could help. Could you add a sample repro case or more information?

@loadams loadams changed the title {{ env.GITHUB_WORKFLOW }} CI test failure Error when parsing --include= Oct 28, 2024
@stephen-nju
Copy link
Author

deepspeed version=0.15.3
when the --include SOLT is omit,the deepspeed parsing error,raise list index out of range error.

The difference between the two scripts below is only the --include params
##############################
./llmtrain --do_train --stage sft --name=1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4 --model_name_or_path /home/jovyan/zhubin/DATA/models/Qwen/Qwen2.5-14B --template qwen --dataset firefly_summary_part,COIG_PC_core_summary_part,vcsum_headlines,dialogsum,alpace_gpt4_zh_retain,csds_dialogue,alimeeting,union_conversations_v4_norm,liantong_conversations_v1 --finetuning_type full --batch_size 2 --gradient_accumulation_steps 4 --cutoff_len 2048 --epochs 2 --lr=2e-5 --save_strategy=steps --save_steps=500 --save_total_limit=10 --eval_dataset union_conversations_v4_dev --eval_strategy=steps --eval_steps=500 --warmup_ratio=0.01 --include node1

optinonal paramas neftune_noise_alpha is null
wandb dir=/home/jovyan/zhubin/saved_checkpoint/1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4/logs
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
[2024-10-29 16:21:41,935] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/home/jovyan/conda-env/envs/zb_sw/bin/deepspeed", line 6, in
main()
File "/home/jovyan/conda-env/envs/zb_sw/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 464, in main
first_host = list(active_resources.keys())[0]
IndexError: list index out of range
#####################################
(zb_sw) ➜ shell git:(zhubin) ✗ ./llmtrain --do_train --stage sft --name=1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4 --model_name_or_path /home/jovyan/zhubin/DATA/models/Qwen/Qwen2.5-14B --template qwen --dataset firefly_summary_part,COIG_PC_core_summary_part,vcsum_headlines,dialogsum,alpace_gpt4_zh_retain,csds_dialogue,alimeeting,union_conversations_v4_norm,liantong_conversations_v1 --finetuning_type full --batch_size 2 --gradient_accumulation_steps 4 --cutoff_len 2048 --epochs 2 --lr=2e-5 --save_strategy=steps --save_steps=500 --save_total_limit=10 --eval_dataset union_conversations_v4_dev --eval_strategy=steps --eval_steps=500 --warmup_ratio=0.01 --include node1:1,2,3,4,5

optinonal paramas neftune_noise_alpha is null
wandb dir=/home/jovyan/zhubin/saved_checkpoint/1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4/logs
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
[2024-10-29 16:24:36,968] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 16:24:42,162] [INFO] [runner.py:497:main] Using IP address of 10.71.130.130 for node node1
[2024-10-29 16:24:42,162] [INFO] [runner.py:608:main] cmd = /home/jovyan/conda-env/envs/zb_sw/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJub2RlMSI6IFsxLCAyLCAzLCA0LCA1XX0= --master_addr=10.71.130.130 --master_port=56551 --no_local_rank --enable_each_rank_log=None src/train.py --deepspeed /home/jovyan/zhubin/code/LLaMA-Factory//config/deepspeed/zero_stage2_config.json --stage sft --pref_beta 0.1 --pref_loss simpo --simpo_gamma 0.5 --template qwen --do_train true --do_eval false --eval_strategy steps --model_name_or_path /home/jovyan/zhubin/DATA/models/Qwen/Qwen2.5-14B --resize_vocab true --use_fast_tokenizer false --report_to wandb --overwrite_output_dir --overwrite_cache --dataset firefly_summary_part,COIG_PC_core_summary_part,vcsum_headlines,dialogsum,alpace_gpt4_zh_retain,csds_dialogue,alimeeting,union_conversations_v4_norm,liantong_conversations_v1 --cutoff_len 2048 --output_dir /home/jovyan/zhubin/saved_checkpoint/1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4 --num_train_epochs 2 --overwrite_cache --finetuning_type full --lora_rank 32 --lora_target all --warmup_ratio 0.01 --logging_steps 5 --lr_scheduler_type cosine --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --preprocessing_num_workers 16 --save_strategy steps --save_steps 500 --save_total_limit 10 --learning_rate 2e-5 --ddp_timeout 180000000 --bf16 true --eval_dataset union_conversations_v4_dev --eval_steps 500
[2024-10-29 16:24:44,682] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 16:24:48,272] [INFO] [launch.py:146:main] WORLD INFO DICT: {'node1': [1, 2, 3, 4, 5]}
[2024-10-29 16:24:48,272] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=5, node_rank=0
[2024-10-29 16:24:48,272] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1, 2, 3, 4]})
[2024-10-29 16:24:48,272] [INFO] [launch.py:164:main] dist_world_size=5
[2024-10-29 16:24:48,272] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=1,2,3,4,5
[2024-10-29 16:24:48,272] [INFO] [launch.py:256:main] process 2471926 spawned with command: ['/home/jovyan/conda-env/envs/zb_sw/bin/python3.10', '-u', 'src/train.py', '--deepspeed', '/home/jovyan/zhubin/code/LLaMA-Factory//config/deepspeed/zero_stage2_config.json', '--stage', 'sft', '--pref_beta', '0.1', '--pref_loss', 'simpo', '--simpo_gamma', '0.5', '--template', 'qwen', '--do_train', 'true', '--do_eval', 'false', '--eval_strategy', 'steps', '--model_name_or_path', '/home/jovyan/zhubin/DATA/models/Qwen/Qwen2.5-14B', '--resize_vocab', 'true', '--use_fast_tokenizer', 'false', '--report_to', 'wandb', '--overwrite_output_dir', '--overwrite_cache', '--dataset', 'firefly_summary_part,COIG_PC_core_summary_part,vcsum_headlines,dialogsum,alpace_gpt4_zh_retain,csds_dialogue,alimeeting,union_conversations_v4_norm,liantong_conversations_v1', '--cutoff_len', '2048', '--output_dir',

@loadams
Copy link
Contributor

loadams commented Oct 31, 2024

Hi @stephen-nju - I'm still not sure I follow what the problem is, could you try listing it one more time? You believe there is a bug that when passing in the node to the --include but don't specify the full list of GPUs on the node, you're not seeing them all used?

Would you consider opening a PR to fix to what you believe is the correct parsing?

@loadams loadams changed the title Error when parsing --include= Error when parsing GPUs on a node when only specifying node name --include=node3 vs --include=node3:1,2,4 Oct 31, 2024
@stephen-nju
Copy link
Author

Hi @loadams - I think the arguemnts --include=node3 is equal to --include=node3:1,2,3,4,5,6,7,8 when there are 8 GPUS on the node3, But when set the arguments --include=node3, the program raise " IndexError: list index out of range". it not use the default devices(8GPUS) on the 'node3'

stephen-nju added a commit to stephen-nju/DeepSpeed that referenced this issue Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants