Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'rewards/accuracy_reward': 0.0 #255

Open
HarveyYi opened this issue Feb 9, 2025 · 14 comments
Open

'rewards/accuracy_reward': 0.0 #255

HarveyYi opened this issue Feb 9, 2025 · 14 comments

Comments

@HarveyYi
Copy link

HarveyYi commented Feb 9, 2025

When I try to use the new code to train GRPO, rewards/format_reward is not 0, but rewards/accuracy_reward becomes 0. I used 8×3090 GPUs without changing any code.

{'loss': 0.0, 'grad_norm': 0.7982587522423048, 'learning_rate': 1.5384615384615387e-06, 'rewards/accuracy_reward': 0.15982143627479672, 'rewards/format_reward': 0.4758928777649999, 'rewards/reasoning_steps_reward': 0.17351191807538272, 'rewards/cosine_scaled_reward': -0.09774131584854331, 'reward': 0.7114849153440446, 'reward_std': 0.5312667317688465, 'completion_length': 387.50359020233157, 'kl': 0.0001536548137664795, 'epoch': 0.01}
{'loss': 0.0002, 'grad_norm': 0.8481642609786243, 'learning_rate': 3.0769230769230774e-06, 'rewards/accuracy_reward': 0.1178571494296193, 'rewards/format_reward': 0.5544643074274063, 'rewards/reasoning_steps_reward': 0.1505952463950962, 'rewards/cosine_scaled_reward': -0.12882936951355078, 'reward': 0.6940873321145773, 'reward_std': 0.5247403308749199, 'completion_length': 378.24644565582275, 'kl': 0.0039509057998657225, 'epoch': 0.02}
{'loss': 0.0045, 'grad_norm': 1.8925111056766821, 'learning_rate': 4.615384615384616e-06, 'rewards/accuracy_reward': 0.06428571762517095, 'rewards/format_reward': 0.8571428976953029, 'rewards/reasoning_steps_reward': 0.05982143310829997, 'rewards/cosine_scaled_reward': -0.09459048047865508, 'reward': 0.8866595663130283, 'reward_std': 0.3431109061697498, 'completion_length': 190.53750896453857, 'kl': 0.11153717041015625, 'epoch': 0.02}
{'loss': 0.0038, 'grad_norm': 2.4973364177161543, 'learning_rate': 6.153846153846155e-06, 'rewards/accuracy_reward': 0.0973214341327548, 'rewards/format_reward': 0.7250000361353159, 'rewards/reasoning_steps_reward': 0.08363095866516232, 'rewards/cosine_scaled_reward': -0.06129880832741037, 'reward': 0.8446536116302014, 'reward_std': 0.4255119782872498, 'completion_length': 210.55715255737306, 'kl': 0.09504852294921876, 'epoch': 0.03}
{'loss': 0.0117, 'grad_norm': 1.8205454632746227, 'learning_rate': 7.692307692307694e-06, 'rewards/accuracy_reward': 0.06607143227010966, 'rewards/format_reward': 0.8241071775555611, 'rewards/reasoning_steps_reward': 0.057738099806010725, 'rewards/cosine_scaled_reward': -0.006089120984688634, 'reward': 0.9418275825679302, 'reward_std': 0.2382179622160038, 'completion_length': 127.70804142951965, 'kl': 0.29290771484375, 'epoch': 0.04}
{'loss': 0.0239, 'grad_norm': 2.3698283132576274, 'learning_rate': 9.230769230769232e-06, 'rewards/accuracy_reward': 0.03839285895228386, 'rewards/format_reward': 0.9375000268220901, 'rewards/reasoning_steps_reward': 0.01875000144354999, 'rewards/cosine_scaled_reward': 0.011117304899380542, 'reward': 1.0057602137327195, 'reward_std': 0.1362626419573644, 'completion_length': 56.87857400178909, 'kl': 0.59853515625, 'epoch': 0.05}
{'loss': 0.0493, 'grad_norm': 1.7423643912985216, 'learning_rate': 1.076923076923077e-05, 'rewards/accuracy_reward': 0.01696428647264838, 'rewards/format_reward': 0.9464285925030709, 'rewards/reasoning_steps_reward': 0.008035714691504835, 'rewards/cosine_scaled_reward': 0.009680203269817866, 'reward': 0.9811088144779205, 'reward_std': 0.08413466750880616, 'completion_length': 24.689286708831787, 'kl': 1.23203125, 'epoch': 0.05}
{'loss': 0.0679, 'grad_norm': 0.6228664363666383, 'learning_rate': 1.230769230769231e-05, 'rewards/accuracy_reward': 0.004464285913854837, 'rewards/format_reward': 0.9562500186264515, 'rewards/reasoning_steps_reward': 0.0047619051299989225, 'rewards/cosine_scaled_reward': -0.0007483152658096515, 'reward': 0.9647279314696788, 'reward_std': 0.051664601983452484, 'completion_length': 19.939286613464354, 'kl': 1.69814453125, 'epoch': 0.06}
{'loss': 0.0648, 'grad_norm': 3.5622746561040874, 'learning_rate': 1.3846153846153847e-05, 'rewards/accuracy_reward': 0.019642858020961284, 'rewards/format_reward': 0.9232143141329289, 'rewards/reasoning_steps_reward': 0.02738095410168171, 'rewards/cosine_scaled_reward': 0.0004586775445204694, 'reward': 0.9706968247890473, 'reward_std': 0.08530688291732531, 'completion_length': 46.170538139343265, 'kl': 1.6201171875, 'epoch': 0.07}
{'loss': 0.0853, 'grad_norm': 1.2804693499762392, 'learning_rate': 1.5384615384615387e-05, 'rewards/accuracy_reward': 0.021428572479635477, 'rewards/format_reward': 0.918750025331974, 'rewards/reasoning_steps_reward': 0.026488097058609127, 'rewards/cosine_scaled_reward': 0.004570821033848915, 'reward': 0.9712375432252884, 'reward_std': 0.09768198747647147, 'completion_length': 45.07946652173996, 'kl': 2.1318359375, 'epoch': 0.08}
{'loss': 0.0893, 'grad_norm': 0.5919450757201359, 'learning_rate': 1.6923076923076924e-05, 'rewards/accuracy_reward': 0.00357142873108387, 'rewards/format_reward': 0.980357152223587, 'rewards/reasoning_steps_reward': 0.004464285960420966, 'rewards/cosine_scaled_reward': -0.00047555512719554827, 'reward': 0.9879173591732979, 'reward_std': 0.025026805807931395, 'completion_length': 17.32410798072815, 'kl': 2.23115234375, 'epoch': 0.09}
{'loss': 0.1188, 'grad_norm': 0.3747100024898707, 'learning_rate': 1.8461538461538465e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.9982142865657806, 'rewards/reasoning_steps_reward': 0.0, 'rewards/cosine_scaled_reward': -0.001305083982879296, 'reward': 0.9969092696905136, 'reward_std': 0.0025804866171370124, 'completion_length': 10.323214709758759, 'kl': 2.96875, 'epoch': 0.09}
{'loss': 0.5144, 'grad_norm': 18.3113651332332, 'learning_rate': 2e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.8821428813040256, 'rewards/reasoning_steps_reward': 0.0, 'rewards/cosine_scaled_reward': -0.010261251570773311, 'reward': 0.8718816455453634, 'reward_std': 0.1483902873678858, 'completion_length': 135.39107729196547, 'kl': 12.84677734375, 'epoch': 0.1}
{'loss': 0.1232, 'grad_norm': 1.5027082937249927, 'learning_rate': 1.999634547413886e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.9241071708500386, 'rewards/reasoning_steps_reward': 0.0, 'rewards/cosine_scaled_reward': -0.009401893630274572, 'reward': 0.9147053003311157, 'reward_std': 0.1095106621832997, 'completion_length': 56.35803804397583, 'kl': 3.0787109375, 'epoch': 0.11}
{'loss': 0.1343, 'grad_norm': 0.5941749161412674, 'learning_rate': 1.9985384567667278e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.963392873108387, 'rewards/reasoning_steps_reward': 0.00029761907644569873, 'rewards/cosine_scaled_reward': -0.007446287681523245, 'reward': 0.9562442392110825, 'reward_std': 0.057977127504466354, 'completion_length': 17.40000056028366, 'kl': 3.3583984375, 'epoch': 0.12}
{'loss': 0.1231, 'grad_norm': 0.5663528276499906, 'learning_rate': 1.9967125291968495e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.9464285895228386, 'rewards/reasoning_steps_reward': 0.0005952381528913975, 'rewards/cosine_scaled_reward': -0.005292993299372028, 'reward': 0.9417308583855629, 'reward_std': 0.06546628615196823, 'completion_length': 27.321429586410524, 'kl': 3.0771484375, 'epoch': 0.12}

@HarveyYi
Copy link
Author

HarveyYi commented Feb 9, 2025

Image

Image

green is the old, purple is the new.

@hellen9527
Copy link

I have the same issue. However, in my case, the initial reward is normal, but in the mid-to-late stages of training, the reward drops to 0. Additionally, the KL divergence becomes very large. It starts small at the beginning but increases more and more over time.

@tenacioustommy
Copy link

i have the same problem, at a certain point , the format reward increases to about 1,meanwhile the accuracy reward drop to 0

@liuchengyuan123
Copy link

Same

@HarveyYi
Copy link
Author

The new code loads the model Qwen2.5-1.5B-Instruct:

model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct

which means it uses the Zero-R1 approach.

In contrast, the old code uses the following model path:

model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

which follows the R1 approach.

When I change the model to DeepSeek-R1-Distill-Qwen-1.5B, the rewards/accuracy_reward does not decrease, but the rewards/format_reward becomes 0.

Now, I am confused. I do not understand the performance of the larger model. If the larger model performs well, maybe it indicates that the smaller model cannot reproduce the "aha" moment or that the 3090's architecture is not suitable? I see that this log is fine: #239 (comment), which uses an L20 48G.

@wander996
Copy link

Same Problem

i have the same problem, at a certain point , the format reward increases to about 1,meanwhile the accuracy reward drop to 0

@liuchengyuan123
Copy link

The new code loads the model Qwen2.5-1.5B-Instruct:

model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
which means it uses the Zero-R1 approach.

In contrast, the old code uses the following model path:

model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
which follows the R1 approach.

When I change the model to DeepSeek-R1-Distill-Qwen-1.5B, the rewards/accuracy_reward does not decrease, but the rewards/format_reward becomes 0.

Now, I am confused. I do not understand the performance of the larger model. If the larger model performs well, maybe it indicates that the smaller model cannot reproduce the "aha" moment or that the 3090's architecture is not suitable? I see that this log is fine: #239 (comment), which uses an L20 48G.

The problem of format rewards seems to be caused by the regex expression, according to other issues?

@HarveyYi
Copy link
Author

The new code loads the model Qwen2.5-1.5B-Instruct:
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
which means it uses the Zero-R1 approach.
In contrast, the old code uses the following model path:
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
which follows the R1 approach.
When I change the model to DeepSeek-R1-Distill-Qwen-1.5B, the rewards/accuracy_reward does not decrease, but the rewards/format_reward becomes 0.
Now, I am confused. I do not understand the performance of the larger model. If the larger model performs well, maybe it indicates that the smaller model cannot reproduce the "aha" moment or that the 3090's architecture is not suitable? I see that this log is fine: #239 (comment), which uses an L20 48G.

The problem of format rewards seems to be caused by the regex expression, according to other issues?

I found the reason in this #198

@liuchengyuan123
Copy link

liuchengyuan123 commented Feb 11, 2025

I find that my original batch size was 2 (per_device_batch_size) * 3 (num_processes) / 6 (num_generations) * 2 (gradient_accumulation_steps) = 2. (hoping I am not missing any details)

When I increase the gradient_accumulation_steps (increasing batch size), the accuracy reward seems to be non-zero. I hope this solution helps.

Image

@hellen9527
Copy link

I found the reason in this #198

@HarveyYi Does this mean that to train the GRPO model, it should be based on the original instruct-based model, rather than the model distilled from DS? Has the distilled DS model started to lose some of its learning capabilities, or is it because they distilled too much data (800,000)?

@HarveyYi
Copy link
Author

I found the reason in this #198

@HarveyYi Does this mean that to train the GRPO model, it should be based on the original instruct-based model, rather than the model distilled from DS? Has the distilled DS model started to lose some of its learning capabilities, or is it because they distilled too much data (800,000)?

I am still quite puzzled and am in the process of investigating the reasons. I noticed that you have successfully replicated the 1.5b model on your end. May I have a chat with you about it?

@hellen9527
Copy link

I found the reason in this #198

@HarveyYi Does this mean that to train the GRPO model, it should be based on the original instruct-based model, rather than the model distilled from DS? Has the distilled DS model started to lose some of its learning capabilities, or is it because they distilled too much data (800,000)?

I am still quite puzzled and am in the process of investigating the reasons. I noticed that you have successfully replicated the 1.5b model on your end. May I have a chat with you about it?

可以的,不过我只是可以跑的通,训练到最后acc也不是很高,不知道是不是1.5b的模型太小还是没训练好。。你看下怎么私聊

@HarveyYi
Copy link
Author

mmqrcode1739243766608.png

可以加个群,一块讨论吗?

@LiuChen19960902
Copy link

感觉应该也是一种reward hanking吧,应该是学到的结果都是<think>很短的过程</think><answer>错误的结果</think>,这时候format和cosine_scaled都很高,但其实acc很低。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants