Skip to content

v0.2.0

Latest
Compare
Choose a tag to compare
@svilupp svilupp released this 02 Feb 20:51
· 29 commits to main since this release

Added

  • Added new models (OpenAI "0125" versions, Codellama, and more)
  • Capability to evaluate code with AgentCodeFixer loop (set codefixing_num_rounds>0 )
  • Automatically set a different seed for commercial API providers (MistralAI, OpenAI) to avoid their caching mechanism
  • Re-scored all past submissions with the new methodology

Fixed

  • Improved code loading and debugging via Julia's code loading mechanism (include_string), which allows to better locate the lines that caused the errors (run evaluate(....; verbose=true) to see which lines caused the errors or return_debug=true to return the debug information as a secondary output).
  • Improved error capture and scoring (eg, imports of Base modules are now correctly recognized as "safe")
  • Improved detection of parse errors (ie, reduces score of submissions that "executed" only because I didn't detect the parsing error earlier)
  • Fixed mkdir bug in run_benchmark

Removed

  • @timeout macro has been upstreamed to PromptingTools

Case Studies

  • Quantization effects on Yi34b and Magicoder 7b
  • Effect of English vs Chinese on performance with Yi34b