Pause the test when not enough API credits #9

I-I-IT · 2025-03-12T13:16:00Z

Last week, I tried benchmarking 3.7 Sonnet Thinking (before you evaluated it), I quickly ran out of credit, and while matharena recognized the error it wanted to continue running the benchmark. IMO, when such an error occurs, it should either exit the test while mentioning the existing results will be saved, and that the user can resume test with the skip-existing flag.

Or you could also pause it and ask user to type Yes to resume, so the user can add credits and then resume.

JasperDekoninck · 2025-03-12T16:50:03Z

HI! Thanks for the feedback. The default behavior of our querier is that it will retry every minute if some error occurs (APIs tend to give quite a few errors if you got disconnected for a moment or so). After 50 errors for a sample, it will give up for this sample and continue to the next, giving an empty output of the model for the sample.

If you run the same command afterwards with the flag --skip-existing it will automatically rerun all samples that were not stored, including those samples for which we find an empty output (so those that got to 50 errors).

If I'm not mistaken, this is the behavior you are asking for (or at least partially), but this is not documented in the README (as it should be). Is this correct? The only thing this does not capture is your question for MathArena to recognise the error, but this is quite error prone (every API has different messages to indicate this, might change them, ...) and I don't think it's a big issue under the condition that we document the --skip-existing flag properly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pause the test when not enough API credits #9

Pause the test when not enough API credits #9

I-I-IT commented Mar 12, 2025

JasperDekoninck commented Mar 12, 2025

Pause the test when not enough API credits #9

Pause the test when not enough API credits #9

Comments

I-I-IT commented Mar 12, 2025

JasperDekoninck commented Mar 12, 2025