Gradient computed twice for this partition #26

terryII · 2024-07-24T09:39:26Z

硬件 4*A100(80G)

微调官方com_dataset数据集，出现如下情况

Traceback (most recent call last):
File "/home/lyk/project/CogCoM/cogcom/finetune.py", line 324, in
model = training_main(args, model_cls=model,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 150, in training_main
iteration, skipped = train(model, optimizer,
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 349, in train
lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 471, in train_step
backward_step(optimizer, model, lm_loss, args, timers)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 507, in backward_step
model.backward(loss)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1976, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2056, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/function.py", line 289, in apply
return user_fn(self, *args)
^^^^^^^^^^^^^^^^^^^^
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 701, in backward
torch.autograd.backward(output_tensors, grad_tensors)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 903, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1416, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 939, in reduce_independent_p_g_buckets_and_remove_grads
assert self.params_already_reduced[param_id] == False,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: The parameter 67 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported
iZ6we1raky4t814hj7bojjZ:5536:6179 [2] NCCL INFO [Service thread] Connection closed by localRank 3
iZ6we1raky4t814hj7bojjZ:5534:6182 [0] NCCL INFO [Service thread] Connection closed by localRank 3
iZ6we1raky4t814hj7bojjZ:5536:6168 [2] NCCL INFO [Service thread] Connection closed by localRank 3
iZ6we1raky4t814hj7bojjZ:5534:6169 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[2024-07-24 17:27:09,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5534
iZ6we1raky4t814hj7bojjZ:5535:6180 [1] NCCL INFO [Service thread] Connection closed by localRank 0
iZ6we1raky4t814hj7bojjZ:5535:6167 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[2024-07-24 17:27:12,368] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5535
iZ6we1raky4t814hj7bojjZ:5536:6179 [2] NCCL INFO [Service thread] Connection closed by localRank 1
iZ6we1raky4t814hj7bojjZ:5536:6168 [2] NCCL INFO [Service thread] Connection closed by localRank 1
[2024-07-24 17:27:15,225] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5536
[2024-07-24 17:27:18,082] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5537
[2024-07-24 17:27:18,082] [ERROR] [launch.py:325:sigkill_handler] ['/home/lyk/anaconda3/envs/llm/bin/python', '-u', '/home/lyk/project/CogCoM/cogcom/finetune.py', '--local_rank=3', '--experiment-name', 'finetune-/data/llms/models/cogcom/cogcom-chat-17b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '8000', '--resume-dataloader', '--from_pretrained', '/data/llms/models/cogcom/cogcom-chat-17b', '--max_source_length', '1225', '--max_target_length', '823', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/data/llms/models/cogcom/vicuna-7b-v1.5', '--version', 'chat', '--train-data', '/data/llms/datasets/cogcom/processed/save/com_offical_0724#CoM', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '4000', '--eval-interval', '4000', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', '/home/lyk/project/CogCoM/cogcom/test_config_bf16_zero1off.json', '--skip-init', '--iterable-dataset', '--seed', '2024'] exits with return code = 1

debug时发现应该是‘crop_and_zoomin’操作后forward了两次（turnid有0和1两轮），然后将两次的loss累加，导致backward时出现重复计算Gradient,请问这个该如果解决？ @qijimrc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient computed twice for this partition #26

Gradient computed twice for this partition #26

terryII commented Jul 24, 2024 •

edited

Loading

Gradient computed twice for this partition #26

Gradient computed twice for this partition #26

Comments

terryII commented Jul 24, 2024 • edited Loading

terryII commented Jul 24, 2024 •

edited

Loading