Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2.5-VL full sft dtype error #6791

Open
1 task done
wyuc opened this issue Feb 2, 2025 · 1 comment
Open
1 task done

Qwen2.5-VL full sft dtype error #6791

wyuc opened this issue Feb 2, 2025 · 1 comment
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@wyuc
Copy link

wyuc commented Feb 2, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.2.dev0
  • Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • PyTorch version: 2.5.1+cu124 (GPU)
  • Transformers version: 4.49.0.dev0
  • Datasets version: 3.2.0
  • Accelerate version: 1.2.1
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800-SXM4-80GB
  • DeepSpeed version: 0.16.2
  • vLLM version: 0.6.5

Reproduction

Training script

### model
model_name_or_path: /model/base/qwen/Qwen2.5-VL-7B-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true  # choices: [true, false]
train_mm_proj_only: false  # choices: [true, false]
deepspeed: examples/deepspeed/ds_z3_config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

flash_attn: fa2

### dataset
dataset: longwriter-v-10k
template: qwen2_vl
cutoff_len: 32768
# max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 8

### output
output_dir: /model/trained/qwen/qwen2.5_vl-7b=
logging_steps: 1
save_steps: 100
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-5
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
# val_size: 0.001
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 100

Error message:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/app/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/app/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/app/src/llamafactory/train/tuner.py", line 92, in run_exp
[rank0]:     _training_function(config={"args": args, "callbacks": callbacks})
[rank0]:   File "/app/src/llamafactory/train/tuner.py", line 66, in _training_function
[rank0]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]:   File "/app/src/llamafactory/train/sft/workflow.py", line 101, in run_sft
[rank0]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2184, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2490, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3598, in training_step
[rank0]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3659, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1914, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1739, in forward
[rank0]:     image_embeds = self.visual(pixel_values, grid_thw=image_grid_thw)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 496, in forward
[rank0]:     hidden_states = self._gradient_checkpointing_func(
[rank0]:   File "/app/src/llamafactory/model/model_utils/checkpointing.py", line 93, in custom_gradient_checkpointing_func
[rank0]:     return gradient_checkpointing_func(func, *args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_compile.py", line 32, in inner
[rank0]:     return disable_fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 489, in checkpoint
[rank0]:     return CheckpointFunction.apply(function, preserve, *args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 264, in forward
[rank0]:     outputs = run_function(*args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 296, in forward
[rank0]:     hidden_states = hidden_states + self.attn(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1790, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 185, in forward
[rank0]:     q = apply_rotary_pos_emb_flashatt(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 166, in apply_rotary_pos_emb_flashatt
[rank0]:     output = apply_rotary_emb(tensor_, cos, sin).type_as(tensor)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/flash_attn/layers/rotary.py", line 122, in apply_rotary_emb
[rank0]:     return ApplyRotaryEmb.apply(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/flash_attn/layers/rotary.py", line 48, in forward
[rank0]:     out = apply_rotary(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/flash_attn/ops/triton/rotary.py", line 176, in apply_rotary
[rank0]:     x.dtype == cos.dtype
[rank0]: AssertionError: Input and cos/sin must have the same dtype, got torch.float32 and torch.bfloat16

Others

No response

@wyuc wyuc added bug Something isn't working pending This problem is yet to be addressed labels Feb 2, 2025
@hiyouga
Copy link
Owner

hiyouga commented Feb 2, 2025

Related issue: QwenLM/Qwen2.5-VL#706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants