We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
硬件 4*A100(80G)
微调官方com_dataset数据集,出现如下情况
Traceback (most recent call last): File "/home/lyk/project/CogCoM/cogcom/finetune.py", line 324, in model = training_main(args, model_cls=model, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 150, in training_main iteration, skipped = train(model, optimizer, ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 349, in train lm_loss, skipped_iter, metrics = train_step(train_data_iterator, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 471, in train_step backward_step(optimizer, model, lm_loss, args, timers) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 507, in backward_step model.backward(loss) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1976, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2056, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/init.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/function.py", line 289, in apply return user_fn(self, *args) ^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 701, in backward torch.autograd.backward(output_tensors, grad_tensors) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/autograd/init.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 903, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1416, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 939, in reduce_independent_p_g_buckets_and_remove_grads assert self.params_already_reduced[param_id] == False, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: The parameter 67 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported iZ6we1raky4t814hj7bojjZ:5536:6179 [2] NCCL INFO [Service thread] Connection closed by localRank 3 iZ6we1raky4t814hj7bojjZ:5534:6182 [0] NCCL INFO [Service thread] Connection closed by localRank 3 iZ6we1raky4t814hj7bojjZ:5536:6168 [2] NCCL INFO [Service thread] Connection closed by localRank 3 iZ6we1raky4t814hj7bojjZ:5534:6169 [0] NCCL INFO [Service thread] Connection closed by localRank 3 [2024-07-24 17:27:09,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5534 iZ6we1raky4t814hj7bojjZ:5535:6180 [1] NCCL INFO [Service thread] Connection closed by localRank 0 iZ6we1raky4t814hj7bojjZ:5535:6167 [1] NCCL INFO [Service thread] Connection closed by localRank 0 [2024-07-24 17:27:12,368] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5535 iZ6we1raky4t814hj7bojjZ:5536:6179 [2] NCCL INFO [Service thread] Connection closed by localRank 1 iZ6we1raky4t814hj7bojjZ:5536:6168 [2] NCCL INFO [Service thread] Connection closed by localRank 1 [2024-07-24 17:27:15,225] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5536 [2024-07-24 17:27:18,082] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 5537 [2024-07-24 17:27:18,082] [ERROR] [launch.py:325:sigkill_handler] ['/home/lyk/anaconda3/envs/llm/bin/python', '-u', '/home/lyk/project/CogCoM/cogcom/finetune.py', '--local_rank=3', '--experiment-name', 'finetune-/data/llms/models/cogcom/cogcom-chat-17b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '8000', '--resume-dataloader', '--from_pretrained', '/data/llms/models/cogcom/cogcom-chat-17b', '--max_source_length', '1225', '--max_target_length', '823', '--lora_rank', '10', '--use_lora', '--local_tokenizer', '/data/llms/models/cogcom/vicuna-7b-v1.5', '--version', 'chat', '--train-data', '/data/llms/datasets/cogcom/processed/save/com_offical_0724#CoM', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '4000', '--eval-interval', '4000', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', '/home/lyk/project/CogCoM/cogcom/test_config_bf16_zero1off.json', '--skip-init', '--iterable-dataset', '--seed', '2024'] exits with return code = 1
debug时发现应该是‘crop_and_zoomin’操作后forward了两次(turnid有0和1两轮),然后将两次的loss累加,导致backward时出现重复计算Gradient,请问这个该如果解决? @qijimrc
The text was updated successfully, but these errors were encountered:
No branches or pull requests
硬件 4*A100(80G)
微调官方com_dataset数据集,出现如下情况
debug时发现应该是‘crop_and_zoomin’操作后forward了两次(turnid有0和1两轮),然后将两次的loss累加,导致backward时出现重复计算Gradient,请问这个该如果解决? @qijimrc
The text was updated successfully, but these errors were encountered: