-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on QA design #7
Comments
Precise evaluation is a bit challenging unless we use an LLM-based classifier (which is fine but can limit the usage)
Interesting but I feel it's a bit hard to extract tho... like not everything has a design pattern.
I feel it would be best to perform tasks that cannot be done by static analyzers (see also #3 (comment))
I assume this is the function search task?
I think this is very interesting! We can ask that given a function, what possible exceptions can be thrown in the form of multi-choice? (using human annotation tho).
Interesting, tho I feel this is hard to do automatically -- but how about we ask about the time and space complexity?
I feel the definition is a bit vague here.... Global note: the task should involve cross-context as much as possible |
Do we want to make every question a multi-choice problem; then all Qs can be evaluated through exact match? The potential issue is that then we are not evaluating if the LLM can generate a valid answer. |
I think for "exception" and "complexity" we don't need multi-choice -- just ask the model to follow a certain format via few-shot and we can parse. -- In that way, we evaluate if the LLM can generate a valid answer. |
we can have multi-choice for open-ended questions and via manual annotation, i think. as such we have 4 types of tasks and we can have 2~4 for each depending on how challenging it is to curate. |
@ganler I've thought of this a bit. I feel understanding a repository may involve two stages:
Taking
I guess for RepoQA our focus is (1), though (1) somewhat subsumes part of (2) because developers of a repo should also know how to use it. Focusing on (1), here are several general question types i can think of:
I haven't considered the answer format. These are kinda open-ended questions and the responses will also be open-ended. Two options: 1) just multi-choice, evaluated using accuracy 2) NL output then evaluated through imprecise metrics (F1 scores / BLEU / ROUGE) like what NLP QA datasets do. Also, I believe this process requires exclusive human inspection or human-gpt4 collaboration. Can't think of an automated process, which also is not necessary if we just have several hundred questions. |
One minor question that I may often spend some time to figure out is the version of the language used in the repo. Sometimes that is directly specified in the readme, but not always. Sometimes it is specified in some build scripts, or you may need to tell by yourself. Similarly, the version of dependent artifacts could bother developers as well. Not sure that can be made as a RepoQA question. |
Random examples:
How do we justify the design?
The text was updated successfully, but these errors were encountered: