[GRPO Trainer] Uneven GPU Utilization When Enabling vLLM with Multi-GPU Training #2825

aeroplanepaper · 2025-02-11T01:26:13Z

Reproduction

Description:
When enabling vLLM with GRPO Trainer for training on 4 GPUs , I noticed that GPU 0 remains nearly idle while only two GPUs are fully utilized. A similar issue occurs when training with 3 GPUs—GPU 0 still has minimal usage, and only one GPU runs at full capacity.

This uneven GPU utilization affects overall training efficiency. I would appreciate any insights into potential causes and solutions for better workload distribution across all available GPUs.

Questions:

What could be causing this imbalance in GPU utilization?
How can I improve GPU utilization to ensure a more balanced workload distribution?
Are there any specific configurations or flags in vLLM or GRPO Trainer that could help resolve this issue?
I have attached a utilization graph illustrating the problem. Please let me know if additional logs or configuration details would be helpful in diagnosing the issue.

Thank you for your time and support!

System Info

Platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.31
Python version: 3.10.16
PyTorch version: 2.5.1
CUDA device(s): NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20
Transformers version: 4.48.3
Accelerate version: 1.3.0
Accelerate config:
- compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 3
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- deepspeed_config: {'gradient_accumulation_steps': 4, 'gradient_clipping': 1.0, 'zero3_init_flag': False, 'zero_stage': 0}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Datasets version: 3.2.0
HF Hub version: 0.28.1
TRL version: 0.15.0.dev0
bitsandbytes version: not installed
DeepSpeed version: 0.16.3
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: 1.61.1
PEFT version: 0.14.0

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

lidh15 · 2025-02-11T02:13:48Z

with very similar environment setup (except for trl 0.15.0.dev0, where is that version? I can only find 0.14.0) I encountered this issue No inf checks were recorded for this optimizer.
when I turn off vllm, the error will not be triggered, I wonder if there is any clue for this error.

github-actions bot added 🏋 GRPO Related to GRPO ⚡accelerate Related to accelerate 🚀 deepspeed Related to deepspeed labels Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GRPO Trainer] Uneven GPU Utilization When Enabling vLLM with Multi-GPU Training #2825

[GRPO Trainer] Uneven GPU Utilization When Enabling vLLM with Multi-GPU Training #2825

aeroplanepaper commented Feb 11, 2025

lidh15 commented Feb 11, 2025

[GRPO Trainer] Uneven GPU Utilization When Enabling vLLM with Multi-GPU Training #2825

[GRPO Trainer] Uneven GPU Utilization When Enabling vLLM with Multi-GPU Training #2825

Comments

aeroplanepaper commented Feb 11, 2025

Reproduction

System Info

Checklist

lidh15 commented Feb 11, 2025