Benchmark for reasoning models #3532

simveit · 2025-02-12T20:12:10Z

Motivation

To evaluate reasoning models it makes sense to use difficult questions. This benchmark intends to use evaluate on the LIMO dataset.
The Qwen 1.5B distill archives 47% accuracy PASS@1.

Modifications

A script to benchmark on LIMO.

Checklist

[x ] Format your code according to the Code Formatting with Pre-Commit.
[x ] Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.

simveit · 2025-02-12T20:19:41Z

Please take a look @zhaochenyang20
I think before merging that we should further refine the parsing of answer and maybe report also majority voting as is commonly done for this kind of benchmark.

zhaochenyang20 · 2025-02-13T06:16:07Z

@simveit What's the official score of this model by LIMO team? Could we align with them?

simveit · 2025-02-13T18:13:31Z

@zhaochenyang20
I think for LIMO we don’t have reference results from there. This is because they used this dataset for training, not evaluation. But maybe someone else did such a benchmarking I am not aware of. Maybe we can ask them if they did such an evaluation internally?

Also we should adjust the script to follow Deepseek more closely:

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 64 responses per query to estimate pass@1.

I will closely resemble this approach in the next update of this branch which I intend to do in the next one or two days.

zhaochenyang20 · 2025-02-14T00:15:52Z

@simveit Thanks. Look forward to it.

simveit · 2025-02-14T20:43:24Z

@zhaochenyang20 this PR includes adjustment of script that includes new way of evaluating suggested in deepssek repo

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of 0.6, a top-p value of 0.95, and generate 64 responses per query to estimate pass@1.

also we include easy possibility to evaluate on other datasets and use that to benchmark on AIME 2024.
The result is somewhat suprising: We get 32.2% instead of 28.9% reported in repo.

I wonder if the discrepancy is due to:

the suffix to prompt \nPlease reason step by step, and put your final answer within \boxed{}. which is commonly used for DeepSeek Math model and also recommended in deepseek r1 repo
maybe the reported result is at temperature 0
maybe something is wrong in the way I evaluate

zhaochenyang20 · 2025-02-15T01:26:59Z

I don't think you are wrong. I will ask help from pengfei and see his idea.

zhaochenyang20 · 2025-02-15T01:31:45Z

@simveit Hey. We are discussing LIMO here. What's for AIME?

simveit · 2025-02-15T04:24:47Z

@zhaochenyang20 32.2% instead of 28.9% was for AIME 2024. The 28.9% are from deepseek r1 repo for qwen 1.5B distill.

I evaluate on LIMO later today. this will take more time than AIME because for LIMO we will eval on ~800*64=51200 samples. For LIMO we don't have any reference, thats why I used AIME to see if the script gives a reasonable result.

zhaochenyang20 · 2025-02-15T18:19:01Z

Wow. We are better than DeepSeek officially 😂 Love to see your PR on both AIME and LIMO @simveit

simveit · 2025-02-15T18:27:34Z

Hi @zhaochenyang20 today I ran benchmark on LIMO dataset, this time with 8 tries for each question, the accuracy was marginally higher than in one try (see updated README for result).

Maybe we can also rename the benchmark to benchmark_reasoning or something like that.
Generally (I believe) it is suitable for every dataset of the form question/answer with answer an int onto which we want to benchmark Deepseek. WDYT?

Next step:

Study parsing function from DeepSeek Math repo and see if we can use them to make the parsing more robust and possibly be able to effectively evaluate also on datasets which have non integer answers, for example $\frac{\pi}{2}$.

simveit · 2025-02-16T18:06:09Z

@zhaochenyang20 now integrated improved parsing and benchmark for AIME 2025.
I think this is close to merge

zhaochenyang20

The code looks great to me. I modified the readme a bit. Could you try to use router instead of --dp-size 4?

simveit · 2025-02-16T18:53:06Z

I don't understand. I used router in one note setting.

python3 -m sglang_router.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --port 30000 --dp-size 4

Do you mean this way of launching runtime and router seperately?

zhaochenyang20 · 2025-02-16T19:01:38Z

Oh. Sorry. I take it wrong. You are right.

zhaochenyang20 · 2025-02-16T19:01:48Z

We will merge it today!

simveit · 2025-02-16T19:02:44Z

great

simveit added 2 commits February 12, 2025 19:59

Added first draft for limo benchmark

445c08e

Formatting

e8ffda3

simveit changed the title ~~Feature/limo benchmark~~ Benchmark for LIMO Feb 12, 2025

fixup

608ab12

simveit added 4 commits February 14, 2025 19:19

Merge branch 'main' into feature/limo-benchmark

17237bb

Evaluated on AIME 2024

a8f0cc3

Merge branch 'main' into feature/limo-benchmark

2fdc21b

reformat

f4450f4

Updated README

9352004

simveit added 2 commits February 16, 2025 14:15

Integrated better parsing. Renamed directory

ffbdb6e

Merge branch 'main' into feature/limo-benchmark

1b40551

simveit changed the title ~~Benchmark for LIMO~~ Benchmark for reasoning models Feb 16, 2025

updated with aime 2025 results

06e1837

zhaochenyang20 marked this pull request as ready for review February 16, 2025 18:31

zhaochenyang20 added 3 commits February 16, 2025 10:31

Merge branch 'main' into feature/limo-benchmark

76199a3

Update README.md

b19bfe0

Update README.md

595db21

zhaochenyang20 requested changes Feb 16, 2025

View reviewed changes

zhyncs merged commit 3d4a8f9 into sgl-project:main Feb 16, 2025
1 check passed

zhaochenyang20 mentioned this pull request Feb 16, 2025

[Feature] Parallelism Experiments on AIMO and LIMO #3615

Closed

2 tasks

simveit deleted the feature/limo-benchmark branch February 17, 2025 19:07

simveit mentioned this pull request Feb 18, 2025

Variance measure for reasoning benchmark #3677

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark for reasoning models #3532

Benchmark for reasoning models #3532

simveit commented Feb 12, 2025

simveit commented Feb 12, 2025

zhaochenyang20 commented Feb 13, 2025

simveit commented Feb 13, 2025 •

edited

Loading

zhaochenyang20 commented Feb 14, 2025

simveit commented Feb 14, 2025 •

edited

Loading

zhaochenyang20 commented Feb 15, 2025

zhaochenyang20 commented Feb 15, 2025

simveit commented Feb 15, 2025

zhaochenyang20 commented Feb 15, 2025

simveit commented Feb 15, 2025

simveit commented Feb 16, 2025

zhaochenyang20 left a comment

simveit commented Feb 16, 2025

zhaochenyang20 commented Feb 16, 2025

zhaochenyang20 commented Feb 16, 2025

simveit commented Feb 16, 2025

Benchmark for reasoning models #3532

Benchmark for reasoning models #3532

Conversation

simveit commented Feb 12, 2025

Motivation

Modifications

Checklist

simveit commented Feb 12, 2025

zhaochenyang20 commented Feb 13, 2025

simveit commented Feb 13, 2025 • edited Loading

zhaochenyang20 commented Feb 14, 2025

simveit commented Feb 14, 2025 • edited Loading

zhaochenyang20 commented Feb 15, 2025

zhaochenyang20 commented Feb 15, 2025

simveit commented Feb 15, 2025

zhaochenyang20 commented Feb 15, 2025

simveit commented Feb 15, 2025

simveit commented Feb 16, 2025

zhaochenyang20 left a comment

Choose a reason for hiding this comment

simveit commented Feb 16, 2025

zhaochenyang20 commented Feb 16, 2025

zhaochenyang20 commented Feb 16, 2025

simveit commented Feb 16, 2025

simveit commented Feb 13, 2025 •

edited

Loading

simveit commented Feb 14, 2025 •

edited

Loading