Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request + eval notes: Cross-Model Validation for Pass@N #2413

Open
e271828- opened this issue Nov 21, 2024 · 2 comments
Open

Feature Request + eval notes: Cross-Model Validation for Pass@N #2413

e271828- opened this issue Nov 21, 2024 · 2 comments
Labels
question Further information is requested

Comments

@e271828-
Copy link

Issue

In our internal evals, we have seen a substantial improvement from alternating models when an oracle fails. In other words:

  1. call sonnet with prompt
  2. sonnet produces code that fails test
  3. send (test, test output, and sonnet code) to 4o
  4. if test fails again, send (4o's revised code, test, test output) to sonnet
  5. repeat until test passes

This converges much more reliably and in fewer steps than using just sonnet in this loop, or using 4o as the initial generator.

Version and model info

No response

@paul-gauthier
Copy link
Collaborator

Thanks for trying aider and filing this issue.

Yes, aider did this for SWE Bench.

https://aider.chat/2024/05/22/swe-bench-lite.html#benchmark-methodology

Interactively, if the response is bad you can do /undo, /model other-model and use up-arrow to re-send your request.

@e271828-
Copy link
Author

e271828- commented Nov 21, 2024

Thanks for your work on aider!

https://aider.chat/2024/05/22/swe-bench-lite.html#benchmark-methodology
...
If the solution isn’t plausible, the harness launches aider to try again from scratch, alternating between using aider with GPT-4o and Opus.

This sounds different than our method: we keep appending outputs to the context, rather than starting fresh, and send the entire result chain between models.

In our evals, this works much better than starting from scratch. More questions converge with the extra context than with a clean state. The test output is also important context; many questions do not converge without it.

Took a quick look at https://github.com/Aider-AI/aider-swe-bench/blob/main/harness.py to confirm what you did, but it only has one model defined so I'm not certain which commit was used for the benchmark.

Interactively, if the response is bad you can do /undo, /model other-model and use up-arrow to re-send your request.

Productizing this result for interactive use could look like an auto-review feature, giving the second model a chance to make changes with a customized review prompt, or explicitly nominating a test to run in this loop.

@paul-gauthier paul-gauthier added the question Further information is requested label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants