Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why don't rewards increase instead of staying at a certain value in GRPO? #474

Open
AXy1527 opened this issue Mar 5, 2025 · 0 comments

Comments

@AXy1527
Copy link

AXy1527 commented Mar 5, 2025

I used gsm8k data set for GRPO training, and the model is Qwen2.5-1.5B-Instruct.
Here are my config and figures, how to solve this?

model_name_or_path: ../experiment/models/Qwen2.5-1.5B-Instruct
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2

dataset_name: ../experiment/datasets/gsm8k/main
dataset_configs:

  • train
    system_prompt: "You are a helpful assistant. A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The Assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within and tags, respectively, i.e., reasoning process here answer here ."
    num_processes: 3

bf16: true
use_vllm: true
vllm_device: auto
vllm_gpu_memory_utilization: 0.8
do_eval: false
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 1.0e-06
log_completions: true
log_level: info
logging_first_step: true
logging_steps: 1
logging_strategy: steps
lr_scheduler_type: cosine
max_prompt_length: 512
max_completion_length: 768
max_steps: 500
num_generations: 8
num_train_epochs: 1
output_dir: outputs/Qwen2.5-1.5B-Open-R1-GRPO-new-5
overwrite_output_dir: true
per_device_eval_batch_size: 16
per_device_train_batch_size: 16
push_to_hub: false
report_to:

  • tensorboard
    reward_funcs:
  • accuracy
  • format
    reward_weights:
  • 1.0
  • 1.0
    save_strategy: "steps"
    save_steps: 50
    seed: 42
    warmup_ratio: 0.1
    beta: 0.04

Image

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant