Update: Fix eval crash by disabling vLLM when using DeepSpeed #147

ATaylorAerospace · 2025-02-01T04:02:12Z

What’s broken:
In #145 evaluation crashes with AttributeError: model has no 'optimizer' because DeepSpeed Zero-3 hides the optimizer (it’s managed separately), and vLLM’s setup clashes with this.

The fix:
Disable vLLM in recipes/qwen/Qwen2.5-1.5B-Instruct/grpo/confg_full.yaml. This stops the conflict and DeepSpeed now handles everything correctly.

PR Description: What’s broken: In huggingface#145 evaluation crashes with AttributeError: model has no 'optimizer' because DeepSpeed Zero-3 hides the optimizer (it’s managed separately), and vLLM’s setup clashes with this. The fix: Disable vLLM in config_full.yaml. This stops the conflict and DeepSpeed now handles everything correctly.

Update: Fix eval crash by disabling vLLM when using DeepSpeed

Some-random · 2025-02-01T19:48:43Z

I seem to be having tensor size mismatch error after changing use_vllm to false

Invalidate trace cache @ step 0 and module 740: cache has only 0 modules
[rank3]: Traceback (most recent call last):
[rank3]:   File "/fsx/ubuntu/open-r1/src/open_r1/grpo.py", line 237, in <module>
[rank3]:     main(script_args, training_args, model_args)
[rank3]:   File "/fsx/ubuntu/open-r1/src/open_r1/grpo.py", line 189, in main
[rank3]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 2175, in train
[rank3]:     return inner_training_loop(
[rank3]:            ^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 2490, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/trainer.py", line 3598, in training_step
[rank3]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 422, in compute_loss
[rank3]:     prompt_completion_ids = unwrapped_model.generate(
[rank3]:                             ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/generation/utils.py", line 2224, in generate
[rank3]:     result = self._sample(
[rank3]:              ^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/generation/utils.py", line 3208, in _sample
[rank3]:     outputs = model_forward(**model_inputs, return_dict=True)
[rank3]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
[rank3]:     return inner()
[rank3]:            ^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1790, in inner
[rank3]:     result = forward_call(*args, **kwargs)
[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[rank3]:     return func(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 824, in forward
[rank3]:     outputs = self.model(
[rank3]:               ^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 567, in forward
[rank3]:     layer_outputs = self._gradient_checkpointing_func(
[rank3]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/_compile.py", line 32, in inner
[rank3]:     return disable_fn(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank3]:     return fn(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 496, in checkpoint
[rank3]:     ret = function(*args, **kwargs)
[rank3]:           ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 260, in forward
[rank3]:     hidden_states, self_attn_weights = self.self_attn(
[rank3]:                                        ^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 167, in forward
[rank3]:     query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
[rank3]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank3]:   File "/fsx/ubuntu/miniconda3/envs/openr1/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 90, in apply_rotary_pos_emb
[rank3]:     q_embed = (q * cos) + (rotate_half(q) * sin)
[rank3]:                ~~^~~~~
[rank3]: RuntimeError: The size of tensor a (735) must match the size of tensor b (736) at non-singleton dimension 2

ctjlewis · 2025-02-02T09:07:51Z

@Some-random, ran into the same series of issues as you and OP. Just turned off vllm. cc @qgallouedec

pyh314 · 2025-02-03T14:52:47Z

@Some-random I have the same error as yours after I changed the use_vllm=false. The approach in this conversation seems not work.

Some-random · 2025-02-03T16:54:25Z

Another workaround is to change deepspeed stage to 2 and keep vllm on

ATaylorAerospace added 2 commits January 31, 2025 19:58

Merge pull request #3 from ATaylorAerospace/ATaylorAerospace-patch-3

af3e6ff

Update: Fix eval crash by disabling vLLM when using DeepSpeed

ATaylorAerospace mentioned this pull request Feb 1, 2025

[GRPO Training Errors] 1.(Qwen2ForCausalLM object has no attribute 'optimizer' ); 2.(ZeroDivisionError) #146

Open

Merge branch 'huggingface:main' into main

4209745

Co1lin mentioned this pull request Feb 6, 2025

AttributeError: 'Qwen2ForCausalLM' object has no attribute 'optimizer' during GRPO training with ZERO-3 huggingface/trl#2782

Closed

5 tasks

ATaylorAerospace added 2 commits February 6, 2025 00:03

Merge branch 'huggingface:main' into main

058059a

Merge branch 'huggingface:main' into main

d72dbec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update: Fix eval crash by disabling vLLM when using DeepSpeed #147

Update: Fix eval crash by disabling vLLM when using DeepSpeed #147

ATaylorAerospace commented Feb 1, 2025

Some-random commented Feb 1, 2025

ctjlewis commented Feb 2, 2025

pyh314 commented Feb 3, 2025

Some-random commented Feb 3, 2025

Update: Fix eval crash by disabling vLLM when using DeepSpeed #147

Are you sure you want to change the base?

Update: Fix eval crash by disabling vLLM when using DeepSpeed #147

Conversation

ATaylorAerospace commented Feb 1, 2025

Some-random commented Feb 1, 2025

ctjlewis commented Feb 2, 2025

pyh314 commented Feb 3, 2025

Some-random commented Feb 3, 2025