You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
When enabling vLLM with GRPO Trainer for training on 4 GPUs , I noticed that GPU 0 remains nearly idle while only two GPUs are fully utilized. A similar issue occurs when training with 3 GPUs—GPU 0 still has minimal usage, and only one GPU runs at full capacity.
This uneven GPU utilization affects overall training efficiency. I would appreciate any insights into potential causes and solutions for better workload distribution across all available GPUs.
Questions:
What could be causing this imbalance in GPU utilization?
How can I improve GPU utilization to ensure a more balanced workload distribution?
Are there any specific configurations or flags in vLLM or GRPO Trainer that could help resolve this issue?
I have attached a utilization graph illustrating the problem. Please let me know if additional logs or configuration details would be helpful in diagnosing the issue.
with very similar environment setup (except for trl 0.15.0.dev0, where is that version? I can only find 0.14.0) I encountered this issue No inf checks were recorded for this optimizer.
when I turn off vllm, the error will not be triggered, I wonder if there is any clue for this error.
Reproduction
Description:
When enabling vLLM with GRPO Trainer for training on 4 GPUs , I noticed that GPU 0 remains nearly idle while only two GPUs are fully utilized. A similar issue occurs when training with 3 GPUs—GPU 0 still has minimal usage, and only one GPU runs at full capacity.
This uneven GPU utilization affects overall training efficiency. I would appreciate any insights into potential causes and solutions for better workload distribution across all available GPUs.
Questions:
What could be causing this imbalance in GPU utilization?
How can I improve GPU utilization to ensure a more balanced workload distribution?
Are there any specific configurations or flags in vLLM or GRPO Trainer that could help resolve this issue?
I have attached a utilization graph illustrating the problem. Please let me know if additional logs or configuration details would be helpful in diagnosing the issue.
Thank you for your time and support!
System Info
Checklist
The text was updated successfully, but these errors were encountered: