Roadmap for v0.5.0 #79

zimmski · 2024-04-26T13:12:22Z

The v0.5.0 is mainly meant for introducing more variate. There are three main goals

Introduce more logical cases, to make sure that "better models" have a bigger difference in score.
Introduce more providers so we can test models that have been request and react faster to new releases.

Tasks:

Management
- add https://github.com/symflower/eval-dev-quality/milestone/2
- TODO https://github.com/symflower/eval-dev-quality/pulls?q=is%3Aopen+is%3Apr+milestone%3Av0.5.0
Documentation
- Less fuzz and fluff (see Thomas' feedback)
- Bring in the current blog post, its information, especially the blog post image to showcase the evaluation
- Readme extension: The nice thing about generating tests is that it is easy to automatically check if the result is correct. Needs to compile and provide 100% coverage. But one can only write such tests if they understand the source, so implicitly we are evaluating the language understanding of the LLM.
Evaluation
- Measure processing time of queries Measure processing time of model responses #106 Measure Model response time #105
- Automate multiple runs for more deterministic results Multiple runs in a single evaluation #109 Multiple Runs #108
- Empty responses should be marked as error responses, to indicate that they are on the same level Empty model responses should be handled as errors #97 Empty responses should not be tested but should fail #92
- when an LLM is evaluated to successfully do "plain" and Go fails bug Java works, then it will be blocked to do ALL repositories. In that case it should only be blocked for Go and not Java. Do not cancel successive runs if previous runs had problems #129
- Do clean up of generated files Reset repository per task #148 Repository not reset for multiple tasks #147 Use Git to avoid copying the repository on each model run #114 Use empty Git config in temporary repositories #146 The git repository change requires the GPG password #145
- Java
  - Log Maven commands because they can be faulty (remember that the "surefire" plugin needs a fixed version because GitHub's is tooooo old). With that it is easier to debug symflower v36847
- even powerful models as GPT4 and Llama3 might return EOF or an error (https://github.com/symflower/eval-dev-quality/tree/105%2B108/evaluation-2024-05-14-09%3A18%3A41), add a retry logic: Give models a retry on error #123 Allow to retry a model when it errors #125
Support execution of more models
- Ollama support Integrate Ollama #91 Ollama tool installation #95 Support Ollama provider #96 Fixed Ollama version #117 Ollama version check and update if version is outdated #118 Prepare for Ollama provider #115 Ollama provider #27
- Allow arbitrary URLs for API provider Generic OpenAI API provider #111 Generic OpenAI API provider #112
Reporting and Metrics
- Additional CSVs to sum up overall results, and language individual results Additional CSVs to sum up metrics for all models overall and per language #94 Add additional CSV files that sum up: overall, per-language #83
- fix, Y axis ticks should be readable Fix svg Y axis ticks #73
- fix, Deterministic order of rows in CSV exporting Sort map by model before creating the CSV output #99 Non deterministic test output leads to flaky CI Jobs #98
Multi OS support of eval
- Support MacOS Support MacOS #102
- Support Windows Support Windows #103 Unable to run benchmark tasks on windows due to incorrect directory creation syntax #101 fix, Let Windows CI error on failures, and inject API tokens only once per provider #104
Tools
- Introduce unique ID for addressing tools Introduce an "ID" method to the tool interface #122 Os-independently
- Extend symflower test with a deeper execution coverage export
  - Go
    - Extract to file
    - Cover lines of tests that have exceptions (fixed with symflower v36800) Require at least symflower v36800 #144
  - Java
    - Extract to file
    - Cover lines of tests that have exceptions (fixed with symflower v36800) Require at least symflower v36800 #144
Tasks and cases
- Introduce more cases with logic in "light" repository
  - Go More "write tests" task cases with more complex logic #124
  - Java More Java task cases for test generation #134 More "write tests" task cases with more complex logic #124
Release
- Do a full evaluation with the new version
- Tag version
- Blog post
- Adapt README
- Announce and eat cake

The text was updated successfully, but these errors were encountered:

zimmski · 2024-04-26T13:12:35Z

CC @bauersimon

bauersimon · 2024-04-29T12:02:04Z

Add an app-name to the requests so people know we are the eval https://openrouter.ai/docs#quick-start shows that other openapi-packages implement custom headers, but the one Go package we are using does not implement that. So do a PR to contribute.

seems like a PR does not make sense

bauersimon · 2024-05-27T08:47:33Z

Blogpost idea: misleading comments... how much does it take to confuse the most powerful AI? (credit to @ahumenberger)

ahumenberger · 2024-05-28T05:29:54Z

Blogpost idea: misleading comments... how much does it take to confuse the most powerful AI? (credit to @ahumenberger)

Maybe not only comments. What about obfuscated code, e.g. function and variables names are just random strings?

zimmski · 2024-06-06T09:04:40Z

Take a look at https://x.com/dottxtai/status/1798443290913853770

bauersimon · 2024-06-11T06:41:02Z

Looking through logs... Java consistently has more code than Go for the same tasks, which yields more coverage. So a model that solves all Java tasks but no Go is automatically higher ranked than the opposite.

Part of #79

Munsio · 2024-07-30T12:03:57Z

Closed with #297

Part of #79

zimmski added the enhancement New feature or request label Apr 26, 2024

zimmski added this to the v0.5.0 milestone Apr 26, 2024

zimmski self-assigned this Apr 26, 2024

zimmski mentioned this issue Apr 26, 2024

Roadmap for v0.4.0 #35

Closed

30 tasks

zimmski added roadmap Collection of issues for a release and removed enhancement New feature or request labels Jun 17, 2024

zimmski mentioned this issue Jun 17, 2024

Roadmap for v0.6.0 #195

Closed

bauersimon added a commit that referenced this issue Jul 30, 2024

Update README to 0.5.0 blog post

077fa92

Part of #79

bauersimon mentioned this issue Jul 30, 2024

Update README to 0.5.0 blog post #297

Merged

Munsio closed this as completed Jul 30, 2024

bauersimon mentioned this issue Jul 31, 2024

Roadmap for v1.0.0 #301

Open

90 tasks

Munsio pushed a commit that referenced this issue Aug 28, 2024

Update README to 0.5.0 blog post

e90486d

Part of #79

bauersimon mentioned this issue Jan 15, 2025

Roadmap for v1.1.0 #382

Open

27 tasks

zimmski mentioned this issue Jan 15, 2025

Features, bugs, research, ... that we are not actively working on #387

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap for v0.5.0 #79

Roadmap for v0.5.0 #79

zimmski commented Apr 26, 2024 •

edited by bauersimon

Loading

zimmski commented Apr 26, 2024

bauersimon commented Apr 29, 2024

bauersimon commented May 27, 2024

ahumenberger commented May 28, 2024

zimmski commented Jun 6, 2024

bauersimon commented Jun 11, 2024

Munsio commented Jul 30, 2024

Roadmap for v0.5.0 #79

Roadmap for v0.5.0 #79

Comments

zimmski commented Apr 26, 2024 • edited by bauersimon Loading

zimmski commented Apr 26, 2024

bauersimon commented Apr 29, 2024

bauersimon commented May 27, 2024

ahumenberger commented May 28, 2024

zimmski commented Jun 6, 2024

bauersimon commented Jun 11, 2024

Munsio commented Jul 30, 2024

zimmski commented Apr 26, 2024 •

edited by bauersimon

Loading