Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Claude 3.7 Sonnet #6

Open
I-I-IT opened this issue Mar 1, 2025 · 12 comments
Open

Test Claude 3.7 Sonnet #6

I-I-IT opened this issue Mar 1, 2025 · 12 comments
Assignees

Comments

@I-I-IT
Copy link

I-I-IT commented Mar 1, 2025

Hi, I was wondering if you are going to test the new hybrid model from Anthropic?

@JasperDekoninck
Copy link
Collaborator

Hi, we will evaluate Claude 3.7 for the upcoming competitions. As we don't evaluate models retroactively on past competitions that happened before their release, we will not go back and evaluate it on AIME'25 and HMMT.

@JasperDekoninck JasperDekoninck self-assigned this Mar 1, 2025
@I-I-IT
Copy link
Author

I-I-IT commented Mar 2, 2025

Hi, we will evaluate Claude 3.7 for the upcoming competitions. As we don't evaluate models retroactively on past competitions that happened before their release, we will not go back and evaluate it on AIME'25 and HMMT.

Why?I don't think Claude could have trained on a competition released on February.

@JasperDekoninck
Copy link
Collaborator

It's indeed very unlikely, but not entirely impossible. While we believe it is still valuable to evaluate Claude-3.7 on these new benchmarks, we want to avoid any discussion about potential contamination (even just tuning some final hyperparameters). Therefore, we won't publish any results on our website for this model. The next competition is coming up in two weeks, so we will know then how good it performs on the newest math competition :)

@I-I-IT
Copy link
Author

I-I-IT commented Mar 3, 2025

OK I guess that's the mathemical spirit, never assume something unlikely as impossible.

Yes, we will see. I guess I could also run the test myself (;

Thanks for your amazing work BTW.

@JasperDekoninck
Copy link
Collaborator

Thanks!

@JasperDekoninck
Copy link
Collaborator

Hi,

We just discussed this issue with everyone in the team and given that lots of people keep asking about these models, we will also evaluate them on the older competitions but with some asterisk or indication that they were released after the benchmark.

@I-I-IT
Copy link
Author

I-I-IT commented Mar 6, 2025

Yes that's great. you could also ask for users to check whether they want to activate it or no.

@JasperDekoninck
Copy link
Collaborator

The results for Claude-3.7 Thinking and QwQ are now online.

@baskargopinath
Copy link

baskargopinath commented Mar 8, 2025

can u add for gpt 4.5 and 3.7 non thinking?@JasperDekoninck

@I-I-IT
Copy link
Author

I-I-IT commented Mar 9, 2025

Thanks ! Qvq 32B performance is mindbogling for its size.

@baskargopinath GPT 4.5 is so expensive, Aider paid, 190$ for their benchmark

@JasperDekoninck
Copy link
Collaborator

In addition to @I-I-IT comment on the price, we also prefer to focus our evaluation on reasoning models, since these are the only models that seem to achieve a good accuracy on MathArena. The inclusion of some non-reasoning models in the beginning was mainly to show the impressive jump between reasoning and non-reasoning models. Further increasing the number of models to evaluate by adding two models we know will perform quite bad, seems not ideal.

@baskargopinath
Copy link

i understand, just curious since 3.7 and 4.5 are SOTA for non-reasoning and wanted to see how they compare

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants