-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark for reasoning models #3532
Conversation
Please take a look @zhaochenyang20 |
@simveit What's the official score of this model by LIMO team? Could we align with them? |
@zhaochenyang20 Also we should adjust the script to follow Deepseek more closely:
I will closely resemble this approach in the next update of this branch which I intend to do in the next one or two days. |
@simveit Thanks. Look forward to it. |
@zhaochenyang20 this PR includes adjustment of script that includes new way of evaluating suggested in deepssek repo
also we include easy possibility to evaluate on other datasets and use that to benchmark on AIME 2024. I wonder if the discrepancy is due to:
|
I don't think you are wrong. I will ask help from pengfei and see his idea. |
@simveit Hey. We are discussing LIMO here. What's for AIME? |
@zhaochenyang20 32.2% instead of 28.9% was for AIME 2024. The 28.9% are from deepseek r1 repo for qwen 1.5B distill. I evaluate on LIMO later today. this will take more time than AIME because for LIMO we will eval on ~800*64=51200 samples. For LIMO we don't have any reference, thats why I used AIME to see if the script gives a reasonable result. |
Wow. We are better than DeepSeek officially 😂 Love to see your PR on both AIME and LIMO @simveit |
Hi @zhaochenyang20 today I ran benchmark on LIMO dataset, this time with 8 tries for each question, the accuracy was marginally higher than in one try (see updated README for result). Maybe we can also rename the benchmark to benchmark_reasoning or something like that. Next step:
|
@zhaochenyang20 now integrated improved parsing and benchmark for AIME 2025. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great to me. I modified the readme a bit. Could you try to use router instead of --dp-size 4
?
I don't understand. I used router in one note setting.
Do you mean this way of launching runtime and router seperately? |
Oh. Sorry. I take it wrong. You are right. |
We will merge it today! |
great |
Motivation
To evaluate reasoning models it makes sense to use difficult questions. This benchmark intends to use evaluate on the LIMO dataset.
The Qwen 1.5B distill archives 47% accuracy PASS@1.
Modifications
A script to benchmark on LIMO.
Checklist