Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GRPO Trainer] Uneven GPU Utilization When Enabling vLLM with Multi-GPU Training #2825

Open
5 tasks done
aeroplanepaper opened this issue Feb 11, 2025 · 1 comment
Open
5 tasks done
Labels
⚡accelerate Related to accelerate 🚀 deepspeed Related to deepspeed 🏋 GRPO Related to GRPO

Comments

@aeroplanepaper
Copy link

Reproduction

Description:
When enabling vLLM with GRPO Trainer for training on 4 GPUs , I noticed that GPU 0 remains nearly idle while only two GPUs are fully utilized. A similar issue occurs when training with 3 GPUs—GPU 0 still has minimal usage, and only one GPU runs at full capacity.

This uneven GPU utilization affects overall training efficiency. I would appreciate any insights into potential causes and solutions for better workload distribution across all available GPUs.

Questions:

What could be causing this imbalance in GPU utilization?
How can I improve GPU utilization to ensure a more balanced workload distribution?
Are there any specific configurations or flags in vLLM or GRPO Trainer that could help resolve this issue?
I have attached a utilization graph illustrating the problem. Please let me know if additional logs or configuration details would be helpful in diagnosing the issue.

Image

Thank you for your time and support!

System Info

  • Platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.31
  • Python version: 3.10.16
  • PyTorch version: 2.5.1
  • CUDA device(s): NVIDIA H20, NVIDIA H20, NVIDIA H20, NVIDIA H20
  • Transformers version: 4.48.3
  • Accelerate version: 1.3.0
  • Accelerate config:
    • compute_environment: LOCAL_MACHINE
    • distributed_type: DEEPSPEED
    • mixed_precision: bf16
    • use_cpu: False
    • debug: False
    • num_processes: 3
    • machine_rank: 0
    • num_machines: 1
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • enable_cpu_affinity: False
    • deepspeed_config: {'gradient_accumulation_steps': 4, 'gradient_clipping': 1.0, 'zero3_init_flag': False, 'zero_stage': 0}
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []
  • Datasets version: 3.2.0
  • HF Hub version: 0.28.1
  • TRL version: 0.15.0.dev0
  • bitsandbytes version: not installed
  • DeepSpeed version: 0.16.3
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: 1.61.1
  • PEFT version: 0.14.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@github-actions github-actions bot added 🏋 GRPO Related to GRPO ⚡accelerate Related to accelerate 🚀 deepspeed Related to deepspeed labels Feb 11, 2025
@lidh15
Copy link

lidh15 commented Feb 11, 2025

with very similar environment setup (except for trl 0.15.0.dev0, where is that version? I can only find 0.14.0) I encountered this issue No inf checks were recorded for this optimizer.
when I turn off vllm, the error will not be triggered, I wonder if there is any clue for this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡accelerate Related to accelerate 🚀 deepspeed Related to deepspeed 🏋 GRPO Related to GRPO
Projects
None yet
Development

No branches or pull requests

2 participants