-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test Claude 3.7 Sonnet #6
Comments
Hi, we will evaluate Claude 3.7 for the upcoming competitions. As we don't evaluate models retroactively on past competitions that happened before their release, we will not go back and evaluate it on AIME'25 and HMMT. |
Why?I don't think Claude could have trained on a competition released on February. |
It's indeed very unlikely, but not entirely impossible. While we believe it is still valuable to evaluate Claude-3.7 on these new benchmarks, we want to avoid any discussion about potential contamination (even just tuning some final hyperparameters). Therefore, we won't publish any results on our website for this model. The next competition is coming up in two weeks, so we will know then how good it performs on the newest math competition :) |
OK I guess that's the mathemical spirit, never assume something unlikely as impossible. Yes, we will see. I guess I could also run the test myself (; Thanks for your amazing work BTW. |
Thanks! |
Hi, We just discussed this issue with everyone in the team and given that lots of people keep asking about these models, we will also evaluate them on the older competitions but with some asterisk or indication that they were released after the benchmark. |
Yes that's great. you could also ask for users to check whether they want to activate it or no. |
The results for Claude-3.7 Thinking and QwQ are now online. |
can u add for gpt 4.5 and 3.7 non thinking?@JasperDekoninck |
Thanks ! Qvq 32B performance is mindbogling for its size. @baskargopinath GPT 4.5 is so expensive, Aider paid, 190$ for their benchmark |
In addition to @I-I-IT comment on the price, we also prefer to focus our evaluation on reasoning models, since these are the only models that seem to achieve a good accuracy on MathArena. The inclusion of some non-reasoning models in the beginning was mainly to show the impressive jump between reasoning and non-reasoning models. Further increasing the number of models to evaluate by adding two models we know will perform quite bad, seems not ideal. |
i understand, just curious since 3.7 and 4.5 are SOTA for non-reasoning and wanted to see how they compare |
Hi, I was wondering if you are going to test the new hybrid model from Anthropic?
The text was updated successfully, but these errors were encountered: