GRPO OOM #475

samma1570 · 2025-03-05T08:54:56Z

config

Model arguments

model_name_or_path: "/ossfs/workspace/Logic-RL/Qwen2.5-7B-Instruct"
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

Data training arguments

dataset_name: DigitalLearningGmbH/MATH-lighteval
dataset_config: default
system_prompt: "You are a helpful AI Assistant, designed to provided well-reasoned and detailed responses. You FIRST think about the reasoning process as an internal monologue and then provide the user with the answer. The reasoning process MUST BE enclosed within and tags."

GRPO trainer config

bf16: true
use_vllm: true
vllm_device: auto
vllm_gpu_memory_utilization: 0.7
do_eval: true
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
hub_model_id: Qwen-2.5-7B-Simple-RL
hub_strategy: every_save
learning_rate: 3.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 5
logging_strategy: steps
lr_scheduler_type: cosine
max_prompt_length: 512
max_completion_length: 1024
max_steps: -1
num_generations: 3
num_train_epochs: 1
output_dir: data/Qwen-2.5-7B-Simple-RL
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
push_to_hub: false
report_to:

wandb
reward_funcs:
accuracy
format
reward_weights:
1.0
1.0
save_strategy: "no"
seed: 42
warmup_ratio: 0.1

script：
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml
--num_processes=3 src/open_r1/grpo.py
--config recipes/Qwen2.5-7B-Instruct/grpo/config_simple_rl.yaml

error

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.46 GiB. GPU 0 has a total capacity of 79.35 GiB of which 1.41 GiB is free. Process 29996 has 77.92 GiB memory in use. Of the allocated memory 71.05 GiB is allocated by PyTorch, and 4.77 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting

samma1570 · 2025-03-05T10:09:44Z

4* A100, cuda 12.1

KennyShang · 2025-03-06T02:50:26Z

Try to use deepspeed zero3: recipes/accelerate_configs/zero3.yaml

crownwang13 · 2025-03-06T08:41:55Z

I ran into a similar issue with almost the same model and environment. I tried changing mbs, max_completion_length, vllm_gpu_memory_utilization ，and set deepspeed zero3,but it still didn’t solve the problem. It looks like the main issue is that vLLM's KV cache initialization takes up too much memory. After switching to a smaller model—R1-distilled Qwen-1.5B—I was able to train without any problems.

qgallouedec · 2025-03-06T10:39:49Z

Can you provide the full traceback? The solution depends on when the OOM occurs

samma1570 · 2025-03-06T11:29:26Z

real ,it work

mrb957600057 · 2025-03-06T11:48:32Z

real ,it work

how to handle this problem

greatxue · 2025-03-12T03:31:12Z

Anybody know how to handle it?

greatxue mentioned this issue Mar 12, 2025

How to use vllm with 2 GPUs? #502

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO OOM #475

GRPO OOM #475

samma1570 commented Mar 5, 2025

samma1570 commented Mar 5, 2025

KennyShang commented Mar 6, 2025

crownwang13 commented Mar 6, 2025

qgallouedec commented Mar 6, 2025

samma1570 commented Mar 6, 2025

mrb957600057 commented Mar 6, 2025

greatxue commented Mar 12, 2025

GRPO OOM #475

GRPO OOM #475

Comments

samma1570 commented Mar 5, 2025

Model arguments

Data training arguments

GRPO trainer config

samma1570 commented Mar 5, 2025

KennyShang commented Mar 6, 2025

crownwang13 commented Mar 6, 2025

qgallouedec commented Mar 6, 2025

samma1570 commented Mar 6, 2025

mrb957600057 commented Mar 6, 2025

greatxue commented Mar 12, 2025