Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Score Calculation with LLMs #141

Open
8 tasks
GODrums opened this issue Nov 9, 2024 · 0 comments
Open
8 tasks

Improved Score Calculation with LLMs #141

GODrums opened this issue Nov 9, 2024 · 0 comments

Comments

@GODrums
Copy link
Collaborator

GODrums commented Nov 9, 2024

Objective

We notice that our current score calculation for the leaderboard doesn't appropriately take the reviewing effort invested into account. For example manual tests on Artemis servers take quite a long time and evaluate to the same or less score than a simple code review with a few nit-picks.

A fair effort evaluation often requires decent Natural Language Processing (NLP) and a well-trained LLM. As estimations with a certain degree of accurate are sufficient for our use case of score calculation, we aim to estimate the process with prompt engineering on general-purpose LLMs, like GPT4o.

General idea

  • First: classification test - decide if review was manual testing, code review, etc.
  • Next: LLM complexity assessment - context-based scoring. Give e.g. PR as context in prompt and ask LLM about evaluation

Tasks

  • Prompt engineering
    • Endpoint for Score-Evaluation per LLM request
    • Reviews as prompt
    • Provide prompt context for LLM
  • LLM in score calulation
    • Request from evaluation from LLM endpoint
    • Store LLM score/response in review model
    • Switch from old scoring algorithm to new one in prod
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant