Replace the base model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to Qwen/Qwen2.5-1.5B-Instruct in GRPO #198

DVampire · 2025-02-05T20:48:37Z

I found that the distilled small model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B:

Tends to frequently output words like wait and alternatively.
Struggles to follow the <think></think> <answer></answer> format.

A possible reason could be that distilling a small model from a large model might cause it to lose its ability to follow specific formats.

Therefore, I switched to Qwen/Qwen2.5-1.5B-Instruct and found that it adheres to the format well. The comparison of their format rewards is shown below—hope this helps!

…-Instruct

Omni-FinAgent and others added 2 commits February 6, 2025 04:39

Replace the deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to Qwen2.5-1.5B…

f8cda2e

…-Instruct

Merge branch 'huggingface:main' into main

49c68a6

HarveyYi mentioned this pull request Feb 10, 2025

'rewards/accuracy_reward': 0.0 #255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the base model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to Qwen/Qwen2.5-1.5B-Instruct in GRPO #198

Replace the base model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to Qwen/Qwen2.5-1.5B-Instruct in GRPO #198

DVampire commented Feb 5, 2025 •

edited

Loading

Replace the base model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to Qwen/Qwen2.5-1.5B-Instruct in GRPO #198

Are you sure you want to change the base?

Replace the base model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to Qwen/Qwen2.5-1.5B-Instruct in GRPO #198

Conversation

DVampire commented Feb 5, 2025 • edited Loading

DVampire commented Feb 5, 2025 •

edited

Loading