Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause the test when not enough API credits #9

Open
I-I-IT opened this issue Mar 12, 2025 · 1 comment
Open

Pause the test when not enough API credits #9

I-I-IT opened this issue Mar 12, 2025 · 1 comment

Comments

@I-I-IT
Copy link

I-I-IT commented Mar 12, 2025

Last week, I tried benchmarking 3.7 Sonnet Thinking (before you evaluated it), I quickly ran out of credit, and while matharena recognized the error it wanted to continue running the benchmark. IMO, when such an error occurs, it should either exit the test while mentioning the existing results will be saved, and that the user can resume test with the skip-existing flag.

Or you could also pause it and ask user to type Yes to resume, so the user can add credits and then resume.

@JasperDekoninck
Copy link
Collaborator

HI! Thanks for the feedback. The default behavior of our querier is that it will retry every minute if some error occurs (APIs tend to give quite a few errors if you got disconnected for a moment or so). After 50 errors for a sample, it will give up for this sample and continue to the next, giving an empty output of the model for the sample.

If you run the same command afterwards with the flag --skip-existing it will automatically rerun all samples that were not stored, including those samples for which we find an empty output (so those that got to 50 errors).

If I'm not mistaken, this is the behavior you are asking for (or at least partially), but this is not documented in the README (as it should be). Is this correct? The only thing this does not capture is your question for MathArena to recognise the error, but this is quite error prone (every API has different messages to indicate this, might change them, ...) and I don't think it's a big issue under the condition that we document the --skip-existing flag properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants