You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added new models (OpenAI "0125" versions, Codellama, and more)
Capability to evaluate code with AgentCodeFixer loop (set codefixing_num_rounds>0 )
Automatically set a different seed for commercial API providers (MistralAI, OpenAI) to avoid their caching mechanism
Re-scored all past submissions with the new methodology
Fixed
Improved code loading and debugging via Julia's code loading mechanism (include_string), which allows to better locate the lines that caused the errors (run evaluate(....; verbose=true) to see which lines caused the errors or return_debug=true to return the debug information as a secondary output).
Improved error capture and scoring (eg, imports of Base modules are now correctly recognized as "safe")
Improved detection of parse errors (ie, reduces score of submissions that "executed" only because I didn't detect the parsing error earlier)
Fixed mkdir bug in run_benchmark
Removed
@timeout macro has been upstreamed to PromptingTools
Case Studies
Quantization effects on Yi34b and Magicoder 7b
Effect of English vs Chinese on performance with Yi34b