Error when parsing GPUs on a node when only specifying node name `--include=node3` vs `--include=node3:1,2,4` #6671

stephen-nju · 2024-10-26T10:33:26Z

The Nightly CI for {{ env.GITHUB_SERVER_URL }}/{{ env.GITHUB_REPOSITORY }}/actions/runs/{{ env.GITHUB_RUN_ID }} failed.
when using --include=node3 ,deepspeed parser error ,but --include=node3:1,2,3,4,5,6,7,8 is ok
I checked the runner.py code, when SLOT_LIST_START not in config ,the devices will set to [],but the --include arguements says "If :SLOT is omitted, include all slots on that host"

stephen-nju · 2024-10-26T10:34:39Z

the deepspeed

loadams · 2024-10-28T15:19:40Z

Hi @stephen-nju - can you please update the title to better reflect the issue as the current title is copied from the CI workflow failure script and I tried to write it, but more information could help. Could you add a sample repro case or more information?

stephen-nju · 2024-10-29T08:47:42Z

deepspeed version=0.15.3
when the --include SOLT is omit,the deepspeed parsing error,raise list index out of range error.

The difference between the two scripts below is only the --include params
##############################
./llmtrain --do_train --stage sft --name=1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4 --model_name_or_path /home/jovyan/zhubin/DATA/models/Qwen/Qwen2.5-14B --template qwen --dataset firefly_summary_part,COIG_PC_core_summary_part,vcsum_headlines,dialogsum,alpace_gpt4_zh_retain,csds_dialogue,alimeeting,union_conversations_v4_norm,liantong_conversations_v1 --finetuning_type full --batch_size 2 --gradient_accumulation_steps 4 --cutoff_len 2048 --epochs 2 --lr=2e-5 --save_strategy=steps --save_steps=500 --save_total_limit=10 --eval_dataset union_conversations_v4_dev --eval_strategy=steps --eval_steps=500 --warmup_ratio=0.01 --include node1

optinonal paramas neftune_noise_alpha is null
wandb dir=/home/jovyan/zhubin/saved_checkpoint/1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4/logs
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
[2024-10-29 16:21:41,935] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/home/jovyan/conda-env/envs/zb_sw/bin/deepspeed", line 6, in
main()
File "/home/jovyan/conda-env/envs/zb_sw/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 464, in main
first_host = list(active_resources.keys())[0]
IndexError: list index out of range
#####################################
(zb_sw) ➜ shell git:(zhubin) ✗ ./llmtrain --do_train --stage sft --name=1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4 --model_name_or_path /home/jovyan/zhubin/DATA/models/Qwen/Qwen2.5-14B --template qwen --dataset firefly_summary_part,COIG_PC_core_summary_part,vcsum_headlines,dialogsum,alpace_gpt4_zh_retain,csds_dialogue,alimeeting,union_conversations_v4_norm,liantong_conversations_v1 --finetuning_type full --batch_size 2 --gradient_accumulation_steps 4 --cutoff_len 2048 --epochs 2 --lr=2e-5 --save_strategy=steps --save_steps=500 --save_total_limit=10 --eval_dataset union_conversations_v4_dev --eval_strategy=steps --eval_steps=500 --warmup_ratio=0.01 --include node1:1,2,3,4,5

optinonal paramas neftune_noise_alpha is null
wandb dir=/home/jovyan/zhubin/saved_checkpoint/1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4/logs
W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
[2024-10-29 16:24:36,968] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 16:24:42,162] [INFO] [runner.py:497:main] Using IP address of 10.71.130.130 for node node1
[2024-10-29 16:24:42,162] [INFO] [runner.py:608:main] cmd = /home/jovyan/conda-env/envs/zb_sw/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJub2RlMSI6IFsxLCAyLCAzLCA0LCA1XX0= --master_addr=10.71.130.130 --master_port=56551 --no_local_rank --enable_each_rank_log=None src/train.py --deepspeed /home/jovyan/zhubin/code/LLaMA-Factory//config/deepspeed/zero_stage2_config.json --stage sft --pref_beta 0.1 --pref_loss simpo --simpo_gamma 0.5 --template qwen --do_train true --do_eval false --eval_strategy steps --model_name_or_path /home/jovyan/zhubin/DATA/models/Qwen/Qwen2.5-14B --resize_vocab true --use_fast_tokenizer false --report_to wandb --overwrite_output_dir --overwrite_cache --dataset firefly_summary_part,COIG_PC_core_summary_part,vcsum_headlines,dialogsum,alpace_gpt4_zh_retain,csds_dialogue,alimeeting,union_conversations_v4_norm,liantong_conversations_v1 --cutoff_len 2048 --output_dir /home/jovyan/zhubin/saved_checkpoint/1026_Qwen2.5_14B_fcvdacaul_ep3_lr2e6_bs4 --num_train_epochs 2 --overwrite_cache --finetuning_type full --lora_rank 32 --lora_target all --warmup_ratio 0.01 --logging_steps 5 --lr_scheduler_type cosine --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 4 --preprocessing_num_workers 16 --save_strategy steps --save_steps 500 --save_total_limit 10 --learning_rate 2e-5 --ddp_timeout 180000000 --bf16 true --eval_dataset union_conversations_v4_dev --eval_steps 500
[2024-10-29 16:24:44,682] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-29 16:24:48,272] [INFO] [launch.py:146:main] WORLD INFO DICT: {'node1': [1, 2, 3, 4, 5]}
[2024-10-29 16:24:48,272] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=5, node_rank=0
[2024-10-29 16:24:48,272] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1, 2, 3, 4]})
[2024-10-29 16:24:48,272] [INFO] [launch.py:164:main] dist_world_size=5
[2024-10-29 16:24:48,272] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=1,2,3,4,5
[2024-10-29 16:24:48,272] [INFO] [launch.py:256:main] process 2471926 spawned with command: ['/home/jovyan/conda-env/envs/zb_sw/bin/python3.10', '-u', 'src/train.py', '--deepspeed', '/home/jovyan/zhubin/code/LLaMA-Factory//config/deepspeed/zero_stage2_config.json', '--stage', 'sft', '--pref_beta', '0.1', '--pref_loss', 'simpo', '--simpo_gamma', '0.5', '--template', 'qwen', '--do_train', 'true', '--do_eval', 'false', '--eval_strategy', 'steps', '--model_name_or_path', '/home/jovyan/zhubin/DATA/models/Qwen/Qwen2.5-14B', '--resize_vocab', 'true', '--use_fast_tokenizer', 'false', '--report_to', 'wandb', '--overwrite_output_dir', '--overwrite_cache', '--dataset', 'firefly_summary_part,COIG_PC_core_summary_part,vcsum_headlines,dialogsum,alpace_gpt4_zh_retain,csds_dialogue,alimeeting,union_conversations_v4_norm,liantong_conversations_v1', '--cutoff_len', '2048', '--output_dir',

loadams · 2024-10-31T17:47:09Z

Hi @stephen-nju - I'm still not sure I follow what the problem is, could you try listing it one more time? You believe there is a bug that when passing in the node to the --include but don't specify the full list of GPUs on the node, you're not seeing them all used?

Would you consider opening a PR to fix to what you believe is the correct parsing?

stephen-nju · 2024-11-01T06:26:15Z

Hi @loadams - I think the arguemnts --include=node3 is equal to --include=node3:1,2,3,4,5,6,7,8 when there are 8 GPUS on the node3, But when set the arguments --include=node3, the program raise " IndexError: list index out of range". it not use the default devices(8GPUS) on the 'node3'

stephen-nju added the ci-failure label Oct 26, 2024

loadams removed the ci-failure label Oct 28, 2024

loadams self-assigned this Oct 28, 2024

loadams changed the title ~~{{ env.GITHUB_WORKFLOW }} CI test failure~~ Error when parsing --include= Oct 28, 2024

loadams changed the title ~~Error when parsing --include=~~ Error when parsing GPUs on a node when only specifying node name --include=node3 vs --include=node3:1,2,4 Oct 31, 2024

stephen-nju added a commit to stephen-nju/DeepSpeed that referenced this issue Nov 1, 2024

fix issue �microsoft#6671

43d41d1

stephen-nju linked a pull request Nov 1, 2024 that will close this issue

Allow launcher to include --include=node3, not just --include=node3:1,2,3,4,5,6,7,8 #6698

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when parsing GPUs on a node when only specifying node name `--include=node3` vs `--include=node3:1,2,4` #6671

Error when parsing GPUs on a node when only specifying node name `--include=node3` vs `--include=node3:1,2,4` #6671

stephen-nju commented Oct 26, 2024

stephen-nju commented Oct 26, 2024

loadams commented Oct 28, 2024 •

edited

Loading

stephen-nju commented Oct 29, 2024

loadams commented Oct 31, 2024

stephen-nju commented Nov 1, 2024

Error when parsing GPUs on a node when only specifying node name --include=node3 vs --include=node3:1,2,4 #6671

Error when parsing GPUs on a node when only specifying node name --include=node3 vs --include=node3:1,2,4 #6671

Comments

stephen-nju commented Oct 26, 2024

stephen-nju commented Oct 26, 2024

loadams commented Oct 28, 2024 • edited Loading

stephen-nju commented Oct 29, 2024

loadams commented Oct 31, 2024

stephen-nju commented Nov 1, 2024

Error when parsing GPUs on a node when only specifying node name `--include=node3` vs `--include=node3:1,2,4` #6671

Error when parsing GPUs on a node when only specifying node name `--include=node3` vs `--include=node3:1,2,4` #6671

loadams commented Oct 28, 2024 •

edited

Loading