'rewards/accuracy_reward': 0.0 #255

HarveyYi · 2025-02-09T16:27:59Z

When I try to use the new code to train GRPO, rewards/format_reward is not 0, but rewards/accuracy_reward becomes 0. I used 8×3090 GPUs without changing any code.

{'loss': 0.0, 'grad_norm': 0.7982587522423048, 'learning_rate': 1.5384615384615387e-06, 'rewards/accuracy_reward': 0.15982143627479672, 'rewards/format_reward': 0.4758928777649999, 'rewards/reasoning_steps_reward': 0.17351191807538272, 'rewards/cosine_scaled_reward': -0.09774131584854331, 'reward': 0.7114849153440446, 'reward_std': 0.5312667317688465, 'completion_length': 387.50359020233157, 'kl': 0.0001536548137664795, 'epoch': 0.01}
{'loss': 0.0002, 'grad_norm': 0.8481642609786243, 'learning_rate': 3.0769230769230774e-06, 'rewards/accuracy_reward': 0.1178571494296193, 'rewards/format_reward': 0.5544643074274063, 'rewards/reasoning_steps_reward': 0.1505952463950962, 'rewards/cosine_scaled_reward': -0.12882936951355078, 'reward': 0.6940873321145773, 'reward_std': 0.5247403308749199, 'completion_length': 378.24644565582275, 'kl': 0.0039509057998657225, 'epoch': 0.02}
{'loss': 0.0045, 'grad_norm': 1.8925111056766821, 'learning_rate': 4.615384615384616e-06, 'rewards/accuracy_reward': 0.06428571762517095, 'rewards/format_reward': 0.8571428976953029, 'rewards/reasoning_steps_reward': 0.05982143310829997, 'rewards/cosine_scaled_reward': -0.09459048047865508, 'reward': 0.8866595663130283, 'reward_std': 0.3431109061697498, 'completion_length': 190.53750896453857, 'kl': 0.11153717041015625, 'epoch': 0.02}
{'loss': 0.0038, 'grad_norm': 2.4973364177161543, 'learning_rate': 6.153846153846155e-06, 'rewards/accuracy_reward': 0.0973214341327548, 'rewards/format_reward': 0.7250000361353159, 'rewards/reasoning_steps_reward': 0.08363095866516232, 'rewards/cosine_scaled_reward': -0.06129880832741037, 'reward': 0.8446536116302014, 'reward_std': 0.4255119782872498, 'completion_length': 210.55715255737306, 'kl': 0.09504852294921876, 'epoch': 0.03}
{'loss': 0.0117, 'grad_norm': 1.8205454632746227, 'learning_rate': 7.692307692307694e-06, 'rewards/accuracy_reward': 0.06607143227010966, 'rewards/format_reward': 0.8241071775555611, 'rewards/reasoning_steps_reward': 0.057738099806010725, 'rewards/cosine_scaled_reward': -0.006089120984688634, 'reward': 0.9418275825679302, 'reward_std': 0.2382179622160038, 'completion_length': 127.70804142951965, 'kl': 0.29290771484375, 'epoch': 0.04}
{'loss': 0.0239, 'grad_norm': 2.3698283132576274, 'learning_rate': 9.230769230769232e-06, 'rewards/accuracy_reward': 0.03839285895228386, 'rewards/format_reward': 0.9375000268220901, 'rewards/reasoning_steps_reward': 0.01875000144354999, 'rewards/cosine_scaled_reward': 0.011117304899380542, 'reward': 1.0057602137327195, 'reward_std': 0.1362626419573644, 'completion_length': 56.87857400178909, 'kl': 0.59853515625, 'epoch': 0.05}
{'loss': 0.0493, 'grad_norm': 1.7423643912985216, 'learning_rate': 1.076923076923077e-05, 'rewards/accuracy_reward': 0.01696428647264838, 'rewards/format_reward': 0.9464285925030709, 'rewards/reasoning_steps_reward': 0.008035714691504835, 'rewards/cosine_scaled_reward': 0.009680203269817866, 'reward': 0.9811088144779205, 'reward_std': 0.08413466750880616, 'completion_length': 24.689286708831787, 'kl': 1.23203125, 'epoch': 0.05}
{'loss': 0.0679, 'grad_norm': 0.6228664363666383, 'learning_rate': 1.230769230769231e-05, 'rewards/accuracy_reward': 0.004464285913854837, 'rewards/format_reward': 0.9562500186264515, 'rewards/reasoning_steps_reward': 0.0047619051299989225, 'rewards/cosine_scaled_reward': -0.0007483152658096515, 'reward': 0.9647279314696788, 'reward_std': 0.051664601983452484, 'completion_length': 19.939286613464354, 'kl': 1.69814453125, 'epoch': 0.06}
{'loss': 0.0648, 'grad_norm': 3.5622746561040874, 'learning_rate': 1.3846153846153847e-05, 'rewards/accuracy_reward': 0.019642858020961284, 'rewards/format_reward': 0.9232143141329289, 'rewards/reasoning_steps_reward': 0.02738095410168171, 'rewards/cosine_scaled_reward': 0.0004586775445204694, 'reward': 0.9706968247890473, 'reward_std': 0.08530688291732531, 'completion_length': 46.170538139343265, 'kl': 1.6201171875, 'epoch': 0.07}
{'loss': 0.0853, 'grad_norm': 1.2804693499762392, 'learning_rate': 1.5384615384615387e-05, 'rewards/accuracy_reward': 0.021428572479635477, 'rewards/format_reward': 0.918750025331974, 'rewards/reasoning_steps_reward': 0.026488097058609127, 'rewards/cosine_scaled_reward': 0.004570821033848915, 'reward': 0.9712375432252884, 'reward_std': 0.09768198747647147, 'completion_length': 45.07946652173996, 'kl': 2.1318359375, 'epoch': 0.08}
{'loss': 0.0893, 'grad_norm': 0.5919450757201359, 'learning_rate': 1.6923076923076924e-05, 'rewards/accuracy_reward': 0.00357142873108387, 'rewards/format_reward': 0.980357152223587, 'rewards/reasoning_steps_reward': 0.004464285960420966, 'rewards/cosine_scaled_reward': -0.00047555512719554827, 'reward': 0.9879173591732979, 'reward_std': 0.025026805807931395, 'completion_length': 17.32410798072815, 'kl': 2.23115234375, 'epoch': 0.09}
{'loss': 0.1188, 'grad_norm': 0.3747100024898707, 'learning_rate': 1.8461538461538465e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.9982142865657806, 'rewards/reasoning_steps_reward': 0.0, 'rewards/cosine_scaled_reward': -0.001305083982879296, 'reward': 0.9969092696905136, 'reward_std': 0.0025804866171370124, 'completion_length': 10.323214709758759, 'kl': 2.96875, 'epoch': 0.09}
{'loss': 0.5144, 'grad_norm': 18.3113651332332, 'learning_rate': 2e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.8821428813040256, 'rewards/reasoning_steps_reward': 0.0, 'rewards/cosine_scaled_reward': -0.010261251570773311, 'reward': 0.8718816455453634, 'reward_std': 0.1483902873678858, 'completion_length': 135.39107729196547, 'kl': 12.84677734375, 'epoch': 0.1}
{'loss': 0.1232, 'grad_norm': 1.5027082937249927, 'learning_rate': 1.999634547413886e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.9241071708500386, 'rewards/reasoning_steps_reward': 0.0, 'rewards/cosine_scaled_reward': -0.009401893630274572, 'reward': 0.9147053003311157, 'reward_std': 0.1095106621832997, 'completion_length': 56.35803804397583, 'kl': 3.0787109375, 'epoch': 0.11}
{'loss': 0.1343, 'grad_norm': 0.5941749161412674, 'learning_rate': 1.9985384567667278e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.963392873108387, 'rewards/reasoning_steps_reward': 0.00029761907644569873, 'rewards/cosine_scaled_reward': -0.007446287681523245, 'reward': 0.9562442392110825, 'reward_std': 0.057977127504466354, 'completion_length': 17.40000056028366, 'kl': 3.3583984375, 'epoch': 0.12}
{'loss': 0.1231, 'grad_norm': 0.5663528276499906, 'learning_rate': 1.9967125291968495e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.9464285895228386, 'rewards/reasoning_steps_reward': 0.0005952381528913975, 'rewards/cosine_scaled_reward': -0.005292993299372028, 'reward': 0.9417308583855629, 'reward_std': 0.06546628615196823, 'completion_length': 27.321429586410524, 'kl': 3.0771484375, 'epoch': 0.12}

HarveyYi · 2025-02-09T16:30:32Z

green is the old, purple is the new.

hellen9527 · 2025-02-10T02:12:56Z

I have the same issue. However, in my case, the initial reward is normal, but in the mid-to-late stages of training, the reward drops to 0. Additionally, the KL divergence becomes very large. It starts small at the beginning but increases more and more over time.

tenacioustommy · 2025-02-10T04:09:22Z

i have the same problem, at a certain point , the format reward increases to about 1,meanwhile the accuracy reward drop to 0

liuchengyuan123 · 2025-02-10T08:01:22Z

Same

HarveyYi · 2025-02-10T08:48:18Z

The new code loads the model Qwen2.5-1.5B-Instruct:

model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct

which means it uses the Zero-R1 approach.

In contrast, the old code uses the following model path:

model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

which follows the R1 approach.

When I change the model to DeepSeek-R1-Distill-Qwen-1.5B, the rewards/accuracy_reward does not decrease, but the rewards/format_reward becomes 0.

Now, I am confused. I do not understand the performance of the larger model. If the larger model performs well, maybe it indicates that the smaller model cannot reproduce the "aha" moment or that the 3090's architecture is not suitable? I see that this log is fine: #239 (comment), which uses an L20 48G.

wander996 · 2025-02-10T09:01:14Z

Same Problem

i have the same problem, at a certain point , the format reward increases to about 1,meanwhile the accuracy reward drop to 0

liuchengyuan123 · 2025-02-10T09:19:31Z

The new code loads the model Qwen2.5-1.5B-Instruct:

model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
which means it uses the Zero-R1 approach.

In contrast, the old code uses the following model path:

model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
which follows the R1 approach.

When I change the model to DeepSeek-R1-Distill-Qwen-1.5B, the rewards/accuracy_reward does not decrease, but the rewards/format_reward becomes 0.

Now, I am confused. I do not understand the performance of the larger model. If the larger model performs well, maybe it indicates that the smaller model cannot reproduce the "aha" moment or that the 3090's architecture is not suitable? I see that this log is fine: #239 (comment), which uses an L20 48G.

The problem of format rewards seems to be caused by the regex expression, according to other issues?

HarveyYi · 2025-02-10T13:01:40Z

The new code loads the model Qwen2.5-1.5B-Instruct:
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
which means it uses the Zero-R1 approach.
In contrast, the old code uses the following model path:
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
which follows the R1 approach.
When I change the model to DeepSeek-R1-Distill-Qwen-1.5B, the rewards/accuracy_reward does not decrease, but the rewards/format_reward becomes 0.
Now, I am confused. I do not understand the performance of the larger model. If the larger model performs well, maybe it indicates that the smaller model cannot reproduce the "aha" moment or that the 3090's architecture is not suitable? I see that this log is fine: #239 (comment), which uses an L20 48G.

The problem of format rewards seems to be caused by the regex expression, according to other issues?

I found the reason in this #198

liuchengyuan123 · 2025-02-11T01:38:10Z

I find that my original batch size was 2 (per_device_batch_size) * 3 (num_processes) / 6 (num_generations) * 2 (gradient_accumulation_steps) = 2. (hoping I am not missing any details)

When I increase the gradient_accumulation_steps (increasing batch size), the accuracy reward seems to be non-zero. I hope this solution helps.

hellen9527 · 2025-02-11T02:22:28Z

I found the reason in this #198

@HarveyYi Does this mean that to train the GRPO model, it should be based on the original instruct-based model, rather than the model distilled from DS? Has the distilled DS model started to lose some of its learning capabilities, or is it because they distilled too much data (800,000)?

HarveyYi · 2025-02-11T02:34:59Z

I found the reason in this #198

@HarveyYi Does this mean that to train the GRPO model, it should be based on the original instruct-based model, rather than the model distilled from DS? Has the distilled DS model started to lose some of its learning capabilities, or is it because they distilled too much data (800,000)?

I am still quite puzzled and am in the process of investigating the reasons. I noticed that you have successfully replicated the 1.5b model on your end. May I have a chat with you about it?

hellen9527 · 2025-02-11T03:10:01Z

I found the reason in this #198

@HarveyYi Does this mean that to train the GRPO model, it should be based on the original instruct-based model, rather than the model distilled from DS? Has the distilled DS model started to lose some of its learning capabilities, or is it because they distilled too much data (800,000)?

I am still quite puzzled and am in the process of investigating the reasons. I noticed that you have successfully replicated the 1.5b model on your end. May I have a chat with you about it?

可以的，不过我只是可以跑的通，训练到最后acc也不是很高，不知道是不是1.5b的模型太小还是没训练好。。你看下怎么私聊

HarveyYi · 2025-02-11T03:16:39Z

可以加个群，一块讨论吗？

LiuChen19960902 · 2025-02-12T01:46:48Z

感觉应该也是一种reward hanking吧，应该是学到的结果都是<think>很短的过程</think><answer>错误的结果</think>，这时候format和cosine_scaled都很高，但其实acc很低。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'rewards/accuracy_reward': 0.0 #255

'rewards/accuracy_reward': 0.0 #255

HarveyYi commented Feb 9, 2025 •

edited

Loading

HarveyYi commented Feb 9, 2025

hellen9527 commented Feb 10, 2025

tenacioustommy commented Feb 10, 2025

liuchengyuan123 commented Feb 10, 2025

HarveyYi commented Feb 10, 2025

wander996 commented Feb 10, 2025

liuchengyuan123 commented Feb 10, 2025

HarveyYi commented Feb 10, 2025

liuchengyuan123 commented Feb 11, 2025 •

edited

Loading

hellen9527 commented Feb 11, 2025

HarveyYi commented Feb 11, 2025

hellen9527 commented Feb 11, 2025

HarveyYi commented Feb 11, 2025

LiuChen19960902 commented Feb 12, 2025

'rewards/accuracy_reward': 0.0 #255

'rewards/accuracy_reward': 0.0 #255

Comments

HarveyYi commented Feb 9, 2025 • edited Loading

HarveyYi commented Feb 9, 2025

hellen9527 commented Feb 10, 2025

tenacioustommy commented Feb 10, 2025

liuchengyuan123 commented Feb 10, 2025

HarveyYi commented Feb 10, 2025

wander996 commented Feb 10, 2025

liuchengyuan123 commented Feb 10, 2025

HarveyYi commented Feb 10, 2025

liuchengyuan123 commented Feb 11, 2025 • edited Loading

hellen9527 commented Feb 11, 2025

HarveyYi commented Feb 11, 2025

hellen9527 commented Feb 11, 2025

HarveyYi commented Feb 11, 2025

LiuChen19960902 commented Feb 12, 2025

HarveyYi commented Feb 9, 2025 •

edited

Loading

liuchengyuan123 commented Feb 11, 2025 •

edited

Loading