Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient computed twice for this partition #26

Open
terryII opened this issue Jul 24, 2024 · 0 comments
Open

Gradient computed twice for this partition #26

terryII opened this issue Jul 24, 2024 · 0 comments

Comments

@terryII
Copy link

terryII commented Jul 24, 2024

硬件 4*A100(80G)

微调官方com_dataset数据集,出现如下情况

Traceback (most recent call last):
File "/home/lyk/project/CogCoM/cogcom/finetune.py", line 324, in
model = training_main(args, model_cls=model,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 150, in training_main
iteration, skipped = train(model, optimizer,
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 349, in train
lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 471, in train_step
backward_step(optimizer, model, lm_loss, args, timers)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 507, in backward_step
model.backward(loss)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2056, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/function.py", line 289, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 701, in backward
torch.autograd.backward(output_tensors, grad_tensors)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 903, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1416, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 939, in reduce_independent_p_g_buckets_and_remove_grads
assert self.params_already_reduced[param_id] == False,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: The parameter 67 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported
iZ6we1raky4t814hj7bojjZ:5536:6179 [2] NCCL INFO [Service thread] Connection closed by localRank 3
iZ6we1raky4t814hj7bojjZ:5534:6182 [0] NCCL INFO [Service thread] Connection closed by localRank 3
iZ6we1raky4t814hj7bojjZ:5536:6168 [2] NCCL INFO [Service thread] Connection closed by localRank 3
iZ6we1raky4t814hj7bojjZ:5534:6169 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[2024-07-24 17:27:09,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5534
iZ6we1raky4t814hj7bojjZ:5535:6180 [1] NCCL INFO [Service thread] Connection closed by localRank 0
iZ6we1raky4t814hj7bojjZ:5535:6167 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[2024-07-24 17:27:12,368] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5535
iZ6we1raky4t814hj7bojjZ:5536:6179 [2] NCCL INFO [Service thread] Connection closed by localRank 1
iZ6we1raky4t814hj7bojjZ:5536:6168 [2] NCCL INFO [Service thread] Connection closed by localRank 1
[2024-07-24 17:27:15,225] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5536
[2024-07-24 17:27:18,082] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5537
[2024-07-24 17:27:18,082] [ERROR] [launch.py:325:sigkill_handler] ['/home/lyk/anaconda3/envs/llm/bin/python', '-u', '/home/lyk/project/CogCoM/cogcom/finetune.py', '--local_rank=3', '--experiment-name', 'finetune-/data/llms/models/cogcom/cogcom-chat-17b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '8000', '--resume-dataloader', '--from_pretrained', '/data/llms/models/cogcom/cogcom-chat-17b', '--max_source_length', '1225', '--max_target_length', '823', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/data/llms/models/cogcom/vicuna-7b-v1.5', '--version', 'chat', '--train-data', '/data/llms/datasets/cogcom/processed/save/com_offical_0724#CoM', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '4000', '--eval-interval', '4000', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', '/home/lyk/project/CogCoM/cogcom/test_config_bf16_zero1off.json', '--skip-init', '--iterable-dataset', '--seed', '2024'] exits with return code = 1

debug时发现应该是‘crop_and_zoomin’操作后forward了两次(turnid有0和1两轮),然后将两次的loss累加,导致backward时出现重复计算Gradient,请问这个该如果解决? @qijimrc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant