-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRPO OOM #475
Comments
4* A100, cuda 12.1 |
Try to use deepspeed zero3: recipes/accelerate_configs/zero3.yaml |
I ran into a similar issue with almost the same model and environment. I tried changing mbs, max_completion_length, vllm_gpu_memory_utilization ,and set deepspeed zero3,but it still didn’t solve the problem. It looks like the main issue is that vLLM's KV cache initialization takes up too much memory. After switching to a smaller model—R1-distilled Qwen-1.5B—I was able to train without any problems. |
Can you provide the full traceback? The solution depends on when the OOM occurs |
real ,it work |
how to handle this problem |
Anybody know how to handle it? |
config
Model arguments
model_name_or_path: "/ossfs/workspace/Logic-RL/Qwen2.5-7B-Instruct"
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
Data training arguments
dataset_name: DigitalLearningGmbH/MATH-lighteval
dataset_config: default
system_prompt: "You are a helpful AI Assistant, designed to provided well-reasoned and detailed responses. You FIRST think about the reasoning process as an internal monologue and then provide the user with the answer. The reasoning process MUST BE enclosed within and tags."
GRPO trainer config
bf16: true
use_vllm: true
vllm_device: auto
vllm_gpu_memory_utilization: 0.7
do_eval: true
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
hub_model_id: Qwen-2.5-7B-Simple-RL
hub_strategy: every_save
learning_rate: 3.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 5
logging_strategy: steps
lr_scheduler_type: cosine
max_prompt_length: 512
max_completion_length: 1024
max_steps: -1
num_generations: 3
num_train_epochs: 1
output_dir: data/Qwen-2.5-7B-Simple-RL
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
push_to_hub: false
report_to:
reward_funcs:
reward_weights:
save_strategy: "no"
seed: 42
warmup_ratio: 0.1
script:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml
--num_processes=3 src/open_r1/grpo.py
--config recipes/Qwen2.5-7B-Instruct/grpo/config_simple_rl.yaml
error
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.46 GiB. GPU 0 has a total capacity of 79.35 GiB of which 1.41 GiB is free. Process 29996 has 77.92 GiB memory in use. Of the allocated memory 71.05 GiB is allocated by PyTorch, and 4.77 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting
The text was updated successfully, but these errors were encountered: