Feature Request + eval notes: Cross-Model Validation for Pass@N #2413

e271828- · 2024-11-21T14:25:08Z

Issue

In our internal evals, we have seen a substantial improvement from alternating models when an oracle fails. In other words:

call sonnet with prompt
sonnet produces code that fails test
send (test, test output, and sonnet code) to 4o
if test fails again, send (4o's revised code, test, test output) to sonnet
repeat until test passes

This converges much more reliably and in fewer steps than using just sonnet in this loop, or using 4o as the initial generator.

Version and model info

No response

paul-gauthier · 2024-11-21T14:46:36Z

Thanks for trying aider and filing this issue.

Yes, aider did this for SWE Bench.

https://aider.chat/2024/05/22/swe-bench-lite.html#benchmark-methodology

Interactively, if the response is bad you can do /undo, /model other-model and use up-arrow to re-send your request.

e271828- · 2024-11-21T15:20:34Z

Thanks for your work on aider!

https://aider.chat/2024/05/22/swe-bench-lite.html#benchmark-methodology
...
If the solution isn’t plausible, the harness launches aider to try again from scratch, alternating between using aider with GPT-4o and Opus.

This sounds different than our method: we keep appending outputs to the context, rather than starting fresh, and send the entire result chain between models.

In our evals, this works much better than starting from scratch. More questions converge with the extra context than with a clean state. The test output is also important context; many questions do not converge without it.

Took a quick look at https://github.com/Aider-AI/aider-swe-bench/blob/main/harness.py to confirm what you did, but it only has one model defined so I'm not certain which commit was used for the benchmark.

Interactively, if the response is bad you can do /undo, /model other-model and use up-arrow to re-send your request.

Productizing this result for interactive use could look like an auto-review feature, giving the second model a chance to make changes with a customized review prompt, or explicitly nominating a test to run in this loop.

paul-gauthier added the question Further information is requested label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request + eval notes: Cross-Model Validation for Pass@N #2413

Feature Request + eval notes: Cross-Model Validation for Pass@N #2413

e271828- commented Nov 21, 2024

paul-gauthier commented Nov 21, 2024

e271828- commented Nov 21, 2024 •

edited

Loading

Feature Request + eval notes: Cross-Model Validation for Pass@N #2413

Feature Request + eval notes: Cross-Model Validation for Pass@N #2413

Comments

e271828- commented Nov 21, 2024

Issue

Version and model info

paul-gauthier commented Nov 21, 2024

e271828- commented Nov 21, 2024 • edited Loading

e271828- commented Nov 21, 2024 •

edited

Loading