Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark for reasoning models #3532

Merged
merged 14 commits into from
Feb 16, 2025
Merged

Conversation

simveit
Copy link
Contributor

@simveit simveit commented Feb 12, 2025

Motivation

To evaluate reasoning models it makes sense to use difficult questions. This benchmark intends to use evaluate on the LIMO dataset.
The Qwen 1.5B distill archives 47% accuracy PASS@1.

Modifications

A script to benchmark on LIMO.

Checklist

@simveit simveit changed the title Feature/limo benchmark Benchmark for LIMO Feb 12, 2025
@simveit
Copy link
Contributor Author

simveit commented Feb 12, 2025

Please take a look @zhaochenyang20
I think before merging that we should further refine the parsing of answer and maybe report also majority voting as is commonly done for this kind of benchmark.

@zhaochenyang20
Copy link
Collaborator

@simveit What's the official score of this model by LIMO team? Could we align with them?

@simveit
Copy link
Contributor Author

simveit commented Feb 13, 2025

@zhaochenyang20
I think for LIMO we don’t have reference results from there. This is because they used this dataset for training, not evaluation. But maybe someone else did such a benchmarking I am not aware of. Maybe we can ask them if they did such an evaluation internally?

Also we should adjust the script to follow Deepseek more closely:

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 64 responses per query to estimate pass@1.

I will closely resemble this approach in the next update of this branch which I intend to do in the next one or two days.

@zhaochenyang20
Copy link
Collaborator

@simveit Thanks. Look forward to it.

@simveit
Copy link
Contributor Author

simveit commented Feb 14, 2025

@zhaochenyang20 this PR includes adjustment of script that includes new way of evaluating suggested in deepssek repo

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 64 responses per query to estimate pass@1.

also we include easy possibility to evaluate on other datasets and use that to benchmark on AIME 2024.
The result is somewhat suprising: We get 32.2% instead of 28.9% reported in repo.

I wonder if the discrepancy is due to:

  • the suffix to prompt \nPlease reason step by step, and put your final answer within \boxed{}. which is commonly used for DeepSeek Math model and also recommended in deepseek r1 repo
  • maybe the reported result is at temperature 0
  • maybe something is wrong in the way I evaluate

@zhaochenyang20
Copy link
Collaborator

I don't think you are wrong. I will ask help from pengfei and see his idea.

@zhaochenyang20
Copy link
Collaborator

@simveit Hey. We are discussing LIMO here. What's for AIME?

@simveit
Copy link
Contributor Author

simveit commented Feb 15, 2025

@zhaochenyang20 32.2% instead of 28.9% was for AIME 2024. The 28.9% are from deepseek r1 repo for qwen 1.5B distill.

I evaluate on LIMO later today. this will take more time than AIME because for LIMO we will eval on ~800*64=51200 samples. For LIMO we don't have any reference, thats why I used AIME to see if the script gives a reasonable result.

@zhaochenyang20
Copy link
Collaborator

Wow. We are better than DeepSeek officially 😂 Love to see your PR on both AIME and LIMO @simveit

@simveit
Copy link
Contributor Author

simveit commented Feb 15, 2025

Hi @zhaochenyang20 today I ran benchmark on LIMO dataset, this time with 8 tries for each question, the accuracy was marginally higher than in one try (see updated README for result).

Maybe we can also rename the benchmark to benchmark_reasoning or something like that.
Generally (I believe) it is suitable for every dataset of the form question/answer with answer an int onto which we want to benchmark Deepseek. WDYT?

Next step:

  • Study parsing function from DeepSeek Math repo and see if we can use them to make the parsing more robust and possibly be able to effectively evaluate also on datasets which have non integer answers, for example $\frac{\pi}{2}$.

@simveit simveit changed the title Benchmark for LIMO Benchmark for reasoning models Feb 16, 2025
@simveit
Copy link
Contributor Author

simveit commented Feb 16, 2025

@zhaochenyang20 now integrated improved parsing and benchmark for AIME 2025.
I think this is close to merge

@zhaochenyang20 zhaochenyang20 marked this pull request as ready for review February 16, 2025 18:31
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great to me. I modified the readme a bit. Could you try to use router instead of --dp-size 4?

@simveit
Copy link
Contributor Author

simveit commented Feb 16, 2025

I don't understand. I used router in one note setting.

python3 -m sglang_router.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 30000 --dp-size 4

Do you mean this way of launching runtime and router seperately?

@zhaochenyang20
Copy link
Collaborator

Oh. Sorry. I take it wrong. You are right.

@zhaochenyang20
Copy link
Collaborator

We will merge it today!

@simveit
Copy link
Contributor Author

simveit commented Feb 16, 2025

great

@zhyncs zhyncs merged commit 3d4a8f9 into sgl-project:main Feb 16, 2025
1 check passed
@simveit simveit deleted the feature/limo-benchmark branch February 17, 2025 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants