-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'rewards/accuracy_reward': 0.0 #255
Comments
I have the same issue. However, in my case, the initial reward is normal, but in the mid-to-late stages of training, the reward drops to 0. Additionally, the KL divergence becomes very large. It starts small at the beginning but increases more and more over time. |
i have the same problem, at a certain point , the format reward increases to about 1,meanwhile the accuracy reward drop to 0 |
Same |
The new code loads the model Qwen2.5-1.5B-Instruct: model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct which means it uses the Zero-R1 approach. In contrast, the old code uses the following model path: model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B which follows the R1 approach. When I change the model to DeepSeek-R1-Distill-Qwen-1.5B, the rewards/accuracy_reward does not decrease, but the rewards/format_reward becomes 0. Now, I am confused. I do not understand the performance of the larger model. If the larger model performs well, maybe it indicates that the smaller model cannot reproduce the "aha" moment or that the 3090's architecture is not suitable? I see that this log is fine: #239 (comment), which uses an L20 48G. |
Same Problem
|
The problem of format rewards seems to be caused by the regex expression, according to other issues? |
I found the reason in this #198 |
I find that my original batch size was 2 (per_device_batch_size) * 3 (num_processes) / 6 (num_generations) * 2 (gradient_accumulation_steps) = 2. (hoping I am not missing any details) When I increase the |
@HarveyYi Does this mean that to train the GRPO model, it should be based on the original instruct-based model, rather than the model distilled from DS? Has the distilled DS model started to lose some of its learning capabilities, or is it because they distilled too much data (800,000)? |
I am still quite puzzled and am in the process of investigating the reasons. I noticed that you have successfully replicated the 1.5b model on your end. May I have a chat with you about it? |
可以的,不过我只是可以跑的通,训练到最后acc也不是很高,不知道是不是1.5b的模型太小还是没训练好。。你看下怎么私聊 |
感觉应该也是一种reward hanking吧,应该是学到的结果都是 |
When I try to use the new code to train GRPO, rewards/format_reward is not 0, but rewards/accuracy_reward becomes 0. I used 8×3090 GPUs without changing any code.
{'loss': 0.0, 'grad_norm': 0.7982587522423048, 'learning_rate': 1.5384615384615387e-06, 'rewards/accuracy_reward': 0.15982143627479672, 'rewards/format_reward': 0.4758928777649999, 'rewards/reasoning_steps_reward': 0.17351191807538272, 'rewards/cosine_scaled_reward': -0.09774131584854331, 'reward': 0.7114849153440446, 'reward_std': 0.5312667317688465, 'completion_length': 387.50359020233157, 'kl': 0.0001536548137664795, 'epoch': 0.01}
{'loss': 0.0002, 'grad_norm': 0.8481642609786243, 'learning_rate': 3.0769230769230774e-06, 'rewards/accuracy_reward': 0.1178571494296193, 'rewards/format_reward': 0.5544643074274063, 'rewards/reasoning_steps_reward': 0.1505952463950962, 'rewards/cosine_scaled_reward': -0.12882936951355078, 'reward': 0.6940873321145773, 'reward_std': 0.5247403308749199, 'completion_length': 378.24644565582275, 'kl': 0.0039509057998657225, 'epoch': 0.02}
{'loss': 0.0045, 'grad_norm': 1.8925111056766821, 'learning_rate': 4.615384615384616e-06, 'rewards/accuracy_reward': 0.06428571762517095, 'rewards/format_reward': 0.8571428976953029, 'rewards/reasoning_steps_reward': 0.05982143310829997, 'rewards/cosine_scaled_reward': -0.09459048047865508, 'reward': 0.8866595663130283, 'reward_std': 0.3431109061697498, 'completion_length': 190.53750896453857, 'kl': 0.11153717041015625, 'epoch': 0.02}
{'loss': 0.0038, 'grad_norm': 2.4973364177161543, 'learning_rate': 6.153846153846155e-06, 'rewards/accuracy_reward': 0.0973214341327548, 'rewards/format_reward': 0.7250000361353159, 'rewards/reasoning_steps_reward': 0.08363095866516232, 'rewards/cosine_scaled_reward': -0.06129880832741037, 'reward': 0.8446536116302014, 'reward_std': 0.4255119782872498, 'completion_length': 210.55715255737306, 'kl': 0.09504852294921876, 'epoch': 0.03}
{'loss': 0.0117, 'grad_norm': 1.8205454632746227, 'learning_rate': 7.692307692307694e-06, 'rewards/accuracy_reward': 0.06607143227010966, 'rewards/format_reward': 0.8241071775555611, 'rewards/reasoning_steps_reward': 0.057738099806010725, 'rewards/cosine_scaled_reward': -0.006089120984688634, 'reward': 0.9418275825679302, 'reward_std': 0.2382179622160038, 'completion_length': 127.70804142951965, 'kl': 0.29290771484375, 'epoch': 0.04}
{'loss': 0.0239, 'grad_norm': 2.3698283132576274, 'learning_rate': 9.230769230769232e-06, 'rewards/accuracy_reward': 0.03839285895228386, 'rewards/format_reward': 0.9375000268220901, 'rewards/reasoning_steps_reward': 0.01875000144354999, 'rewards/cosine_scaled_reward': 0.011117304899380542, 'reward': 1.0057602137327195, 'reward_std': 0.1362626419573644, 'completion_length': 56.87857400178909, 'kl': 0.59853515625, 'epoch': 0.05}
{'loss': 0.0493, 'grad_norm': 1.7423643912985216, 'learning_rate': 1.076923076923077e-05, 'rewards/accuracy_reward': 0.01696428647264838, 'rewards/format_reward': 0.9464285925030709, 'rewards/reasoning_steps_reward': 0.008035714691504835, 'rewards/cosine_scaled_reward': 0.009680203269817866, 'reward': 0.9811088144779205, 'reward_std': 0.08413466750880616, 'completion_length': 24.689286708831787, 'kl': 1.23203125, 'epoch': 0.05}
{'loss': 0.0679, 'grad_norm': 0.6228664363666383, 'learning_rate': 1.230769230769231e-05, 'rewards/accuracy_reward': 0.004464285913854837, 'rewards/format_reward': 0.9562500186264515, 'rewards/reasoning_steps_reward': 0.0047619051299989225, 'rewards/cosine_scaled_reward': -0.0007483152658096515, 'reward': 0.9647279314696788, 'reward_std': 0.051664601983452484, 'completion_length': 19.939286613464354, 'kl': 1.69814453125, 'epoch': 0.06}
{'loss': 0.0648, 'grad_norm': 3.5622746561040874, 'learning_rate': 1.3846153846153847e-05, 'rewards/accuracy_reward': 0.019642858020961284, 'rewards/format_reward': 0.9232143141329289, 'rewards/reasoning_steps_reward': 0.02738095410168171, 'rewards/cosine_scaled_reward': 0.0004586775445204694, 'reward': 0.9706968247890473, 'reward_std': 0.08530688291732531, 'completion_length': 46.170538139343265, 'kl': 1.6201171875, 'epoch': 0.07}
{'loss': 0.0853, 'grad_norm': 1.2804693499762392, 'learning_rate': 1.5384615384615387e-05, 'rewards/accuracy_reward': 0.021428572479635477, 'rewards/format_reward': 0.918750025331974, 'rewards/reasoning_steps_reward': 0.026488097058609127, 'rewards/cosine_scaled_reward': 0.004570821033848915, 'reward': 0.9712375432252884, 'reward_std': 0.09768198747647147, 'completion_length': 45.07946652173996, 'kl': 2.1318359375, 'epoch': 0.08}
{'loss': 0.0893, 'grad_norm': 0.5919450757201359, 'learning_rate': 1.6923076923076924e-05, 'rewards/accuracy_reward': 0.00357142873108387, 'rewards/format_reward': 0.980357152223587, 'rewards/reasoning_steps_reward': 0.004464285960420966, 'rewards/cosine_scaled_reward': -0.00047555512719554827, 'reward': 0.9879173591732979, 'reward_std': 0.025026805807931395, 'completion_length': 17.32410798072815, 'kl': 2.23115234375, 'epoch': 0.09}
{'loss': 0.1188, 'grad_norm': 0.3747100024898707, 'learning_rate': 1.8461538461538465e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.9982142865657806, 'rewards/reasoning_steps_reward': 0.0, 'rewards/cosine_scaled_reward': -0.001305083982879296, 'reward': 0.9969092696905136, 'reward_std': 0.0025804866171370124, 'completion_length': 10.323214709758759, 'kl': 2.96875, 'epoch': 0.09}
{'loss': 0.5144, 'grad_norm': 18.3113651332332, 'learning_rate': 2e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.8821428813040256, 'rewards/reasoning_steps_reward': 0.0, 'rewards/cosine_scaled_reward': -0.010261251570773311, 'reward': 0.8718816455453634, 'reward_std': 0.1483902873678858, 'completion_length': 135.39107729196547, 'kl': 12.84677734375, 'epoch': 0.1}
{'loss': 0.1232, 'grad_norm': 1.5027082937249927, 'learning_rate': 1.999634547413886e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.9241071708500386, 'rewards/reasoning_steps_reward': 0.0, 'rewards/cosine_scaled_reward': -0.009401893630274572, 'reward': 0.9147053003311157, 'reward_std': 0.1095106621832997, 'completion_length': 56.35803804397583, 'kl': 3.0787109375, 'epoch': 0.11}
{'loss': 0.1343, 'grad_norm': 0.5941749161412674, 'learning_rate': 1.9985384567667278e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.963392873108387, 'rewards/reasoning_steps_reward': 0.00029761907644569873, 'rewards/cosine_scaled_reward': -0.007446287681523245, 'reward': 0.9562442392110825, 'reward_std': 0.057977127504466354, 'completion_length': 17.40000056028366, 'kl': 3.3583984375, 'epoch': 0.12}
{'loss': 0.1231, 'grad_norm': 0.5663528276499906, 'learning_rate': 1.9967125291968495e-05, 'rewards/accuracy_reward': 0.0, 'rewards/format_reward': 0.9464285895228386, 'rewards/reasoning_steps_reward': 0.0005952381528913975, 'rewards/cosine_scaled_reward': -0.005292993299372028, 'reward': 0.9417308583855629, 'reward_std': 0.06546628615196823, 'completion_length': 27.321429586410524, 'kl': 3.0771484375, 'epoch': 0.12}
The text was updated successfully, but these errors were encountered: