You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This sounds different than our method: we keep appending outputs to the context, rather than starting fresh, and send the entire result chain between models.
In our evals, this works much better than starting from scratch. More questions converge with the extra context than with a clean state. The test output is also important context; many questions do not converge without it.
Interactively, if the response is bad you can do /undo, /model other-model and use up-arrow to re-send your request.
Productizing this result for interactive use could look like an auto-review feature, giving the second model a chance to make changes with a customized review prompt, or explicitly nominating a test to run in this loop.
Issue
In our internal evals, we have seen a substantial improvement from alternating models when an oracle fails. In other words:
This converges much more reliably and in fewer steps than using just sonnet in this loop, or using 4o as the initial generator.
Version and model info
No response
The text was updated successfully, but these errors were encountered: