Test Claude 3.7 Sonnet #6

I-I-IT · 2025-03-01T10:48:14Z

Hi, I was wondering if you are going to test the new hybrid model from Anthropic?

JasperDekoninck · 2025-03-01T21:28:20Z

Hi, we will evaluate Claude 3.7 for the upcoming competitions. As we don't evaluate models retroactively on past competitions that happened before their release, we will not go back and evaluate it on AIME'25 and HMMT.

I-I-IT · 2025-03-02T15:50:36Z

Hi, we will evaluate Claude 3.7 for the upcoming competitions. As we don't evaluate models retroactively on past competitions that happened before their release, we will not go back and evaluate it on AIME'25 and HMMT.

Why?I don't think Claude could have trained on a competition released on February.

JasperDekoninck · 2025-03-03T19:44:52Z

It's indeed very unlikely, but not entirely impossible. While we believe it is still valuable to evaluate Claude-3.7 on these new benchmarks, we want to avoid any discussion about potential contamination (even just tuning some final hyperparameters). Therefore, we won't publish any results on our website for this model. The next competition is coming up in two weeks, so we will know then how good it performs on the newest math competition :)

I-I-IT · 2025-03-03T21:21:13Z

OK I guess that's the mathemical spirit, never assume something unlikely as impossible.

Yes, we will see. I guess I could also run the test myself (;

Thanks for your amazing work BTW.

JasperDekoninck · 2025-03-04T09:03:16Z

Thanks!

JasperDekoninck · 2025-03-06T12:55:35Z

Hi,

We just discussed this issue with everyone in the team and given that lots of people keep asking about these models, we will also evaluate them on the older competitions but with some asterisk or indication that they were released after the benchmark.

I-I-IT · 2025-03-06T14:42:24Z

Yes that's great. you could also ask for users to check whether they want to activate it or no.

JasperDekoninck · 2025-03-07T18:37:05Z

The results for Claude-3.7 Thinking and QwQ are now online.

baskargopinath · 2025-03-08T10:33:39Z

can u add for gpt 4.5 and 3.7 non thinking?@JasperDekoninck

I-I-IT · 2025-03-09T09:49:57Z

Thanks ! Qvq 32B performance is mindbogling for its size.

@baskargopinath GPT 4.5 is so expensive, Aider paid, 190$ for their benchmark

JasperDekoninck · 2025-03-10T08:58:20Z

In addition to @I-I-IT comment on the price, we also prefer to focus our evaluation on reasoning models, since these are the only models that seem to achieve a good accuracy on MathArena. The inclusion of some non-reasoning models in the beginning was mainly to show the impressive jump between reasoning and non-reasoning models. Further increasing the number of models to evaluate by adding two models we know will perform quite bad, seems not ideal.

baskargopinath · 2025-03-10T09:14:40Z

i understand, just curious since 3.7 and 4.5 are SOTA for non-reasoning and wanted to see how they compare

JasperDekoninck self-assigned this Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Claude 3.7 Sonnet #6

Test Claude 3.7 Sonnet #6

I-I-IT commented Mar 1, 2025

JasperDekoninck commented Mar 1, 2025

I-I-IT commented Mar 2, 2025

JasperDekoninck commented Mar 3, 2025

I-I-IT commented Mar 3, 2025

JasperDekoninck commented Mar 4, 2025

JasperDekoninck commented Mar 6, 2025

I-I-IT commented Mar 6, 2025

JasperDekoninck commented Mar 7, 2025

baskargopinath commented Mar 8, 2025 •

edited

Loading

I-I-IT commented Mar 9, 2025

JasperDekoninck commented Mar 10, 2025

baskargopinath commented Mar 10, 2025

Test Claude 3.7 Sonnet #6

Test Claude 3.7 Sonnet #6

Comments

I-I-IT commented Mar 1, 2025

JasperDekoninck commented Mar 1, 2025

I-I-IT commented Mar 2, 2025

JasperDekoninck commented Mar 3, 2025

I-I-IT commented Mar 3, 2025

JasperDekoninck commented Mar 4, 2025

JasperDekoninck commented Mar 6, 2025

I-I-IT commented Mar 6, 2025

JasperDekoninck commented Mar 7, 2025

baskargopinath commented Mar 8, 2025 • edited Loading

I-I-IT commented Mar 9, 2025

JasperDekoninck commented Mar 10, 2025

baskargopinath commented Mar 10, 2025

baskargopinath commented Mar 8, 2025 •

edited

Loading