Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update: Fix eval crash by disabling vLLM when using DeepSpeed #147

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ATaylorAerospace
Copy link
Contributor

What’s broken:
In #145 evaluation crashes with AttributeError: model has no 'optimizer' because DeepSpeed Zero-3 hides the optimizer (it’s managed separately), and vLLM’s setup clashes with this.

The fix:
Disable vLLM in recipes/qwen/Qwen2.5-1.5B-Instruct/grpo/confg_full.yaml. This stops the conflict and DeepSpeed now handles everything correctly.

PR Description:

What’s broken:
In huggingface#145 evaluation crashes with AttributeError: model has no 'optimizer' because DeepSpeed Zero-3 hides the optimizer (it’s managed separately), and vLLM’s setup clashes with this.

The fix:
Disable vLLM in config_full.yaml. This stops the conflict and DeepSpeed now handles everything correctly.
Update: Fix eval crash by disabling vLLM when using DeepSpeed
@Some-random
Copy link
Contributor

I seem to be having tensor size mismatch error after changing use_vllm to false

Invalidate trace cache @ step 0 and module 740: cache has only 0 modules
[rank3]: Traceback (most recent call last):
[rank3]:   File "/fsx/ubuntu/open-r1/src/open_r1/grpo.py", line 237, in <module>
[rank3]:     main(script_args, training_args, model_args)
[rank3]:   File "/fsx/ubuntu/open-r1/src/open_r1/grpo.py", line 189, in main
[rank3]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 2175, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 2490, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 3598, in training_step
[rank3]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 422, in compute_loss
[rank3]:     prompt_completion_ids = unwrapped_model.generate(
[rank3]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/generation/utils.py", line 2224, in generate
[rank3]:     result = self._sample(
[rank3]:              ^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/generation/utils.py", line 3208, in _sample
[rank3]:     outputs = model_forward(**model_inputs, return_dict=True)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank3]:     return inner()
[rank3]:            ^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 824, in forward
[rank3]:     outputs = self.model(
[rank3]:               ^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 567, in forward
[rank3]:     layer_outputs = self._gradient_checkpointing_func(
[rank3]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner
[rank3]:     return disable_fn(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank3]:     return fn(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 496, in checkpoint
[rank3]:     ret = function(*args, **kwargs)
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 260, in forward
[rank3]:     hidden_states, self_attn_weights = self.self_attn(
[rank3]:                                        ^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 167, in forward
[rank3]:     query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
[rank3]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 90, in apply_rotary_pos_emb
[rank3]:     q_embed = (q * cos) + (rotate_half(q) * sin)
[rank3]:                ~~^~~~~
[rank3]: RuntimeError: The size of tensor a (735) must match the size of tensor b (736) at non-singleton dimension 2

@ctjlewis
Copy link
Contributor

ctjlewis commented Feb 2, 2025

@Some-random, ran into the same series of issues as you and OP. Just turned off vllm. cc @qgallouedec

@pyh314
Copy link

pyh314 commented Feb 3, 2025

@Some-random I have the same error as yours after I changed the use_vllm=false. The approach in this conversation seems not work.

@Some-random
Copy link
Contributor

Another workaround is to change deepspeed stage to 2 and keep vllm on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants