Thoughts on QA design #7

UniverseFly · 2024-03-09T03:07:33Z

Random examples:

Functionality explanation

What's the purpose of func in file

Design patterns

What is the design pattern used for file, respond with one of "OOP", "FP", etc.

Dependencies

What libraries are used for the module module.py

Functionality synthesis

Which function/class in this repo is responsible for xxx functionality?

Error handling

How does the method method handles the case where xxx error...

Algorithms

Describe the algorithm used in func?

Testing

Which test checks for xxx case? What is xxx test for?

General problem solving

What strategy is used to tackle the xxx challenge?

How do we justify the design?

The text was updated successfully, but these errors were encountered:

ganler · 2024-03-09T03:18:14Z

Functionality explanation or repository explanation

Precise evaluation is a bit challenging unless we use an LLM-based classifier (which is fine but can limit the usage)

Design patterns

Interesting but I feel it's a bit hard to extract tho... like not everything has a design pattern.

Dependencies

I feel it would be best to perform tasks that cannot be done by static analyzers (see also #3 (comment))

Functionality synthesis

I assume this is the function search task?

Error handling

I think this is very interesting! We can ask that given a function, what possible exceptions can be thrown in the form of multi-choice? (using human annotation tho).

Algorithms

Interesting, tho I feel this is hard to do automatically -- but how about we ask about the time and space complexity?

7 & 8

I feel the definition is a bit vague here....

Global note: the task should involve cross-context as much as possible

UniverseFly · 2024-03-09T03:23:39Z

Do we want to make every question a multi-choice problem; then all Qs can be evaluated through exact match? The potential issue is that then we are not evaluating if the LLM can generate a valid answer.

ganler · 2024-03-09T03:26:36Z

I think for "exception" and "complexity" we don't need multi-choice -- just ask the model to follow a certain format via few-shot and we can parse. -- In that way, we evaluate if the LLM can generate a valid answer.

ganler · 2024-03-09T03:28:15Z

we can have multi-choice for open-ended questions and via manual annotation, i think.

as such we have 4 types of tasks and we can have 2~4 for each depending on how challenging it is to curate.

UniverseFly · 2024-03-10T20:43:50Z

For each repository, we want to (systematically) support a few types of tasks by asking "What a developer would want to know when they first came to a new repository?"

@ganler I've thought of this a bit. I feel understanding a repository may involve two stages:

understand how its code is built (from repo developers' viewpoints)
know how to use the repo (for repo users' viewpoints)

Taking transformers as an example, it can be:

how does Trainer uniformly handle decoder-only and encoder-decoder models? What trick does it use?
how to initialize a Trainer to train a GPT model?

I guess for RepoQA our focus is (1), though (1) somewhat subsumes part of (2) because developers of a repo should also know how to use it.

Focusing on (1), here are several general question types i can think of:

Purpose understanding

what does the project do
what problem does it solve..

Basic usage understanding

What are the main entry points of the repository, or is it just a library without any entries?

Dependencies (optional, not simply listing the dependencies)

What library/framework is the project based on (e.g., transformers is based on torch, tensorflow, jax, etc., with torch being the most dependent)
Which library is used to calculate BLEU score?

Architecture

(optional) What design strategy is used extensively, OOP/FP/etc.
What is the class hierarchy for AutoModelForCausalLM
How does the library unify torch and tensorflow inputs
Which abstract class is defined for Trainer?

Functionalities (a bit overlap with func search and "know how to use the repo"?)

What universal method is used for generating outputs for a model?
Which method should the user override to define a custom training loss?
Why is the code snippet "..." commented?
Error handling & Algorithms..

Language features

What's the purpose of the list comprehension code: ' ... ' in class C?
Why unsafe is used for fn xxx?

Testing

Which edge cases are considered when testing fun
What test testframework is used, e.g., pytest, unittest?

I haven't considered the answer format. These are kinda open-ended questions and the responses will also be open-ended. Two options: 1) just multi-choice, evaluated using accuracy 2) NL output then evaluated through imprecise metrics (F1 scores / BLEU / ROUGE) like what NLP QA datasets do.

Also, I believe this process requires exclusive human inspection or human-gpt4 collaboration. Can't think of an automated process, which also is not necessary if we just have several hundred questions.

claudeyj · 2024-04-28T21:30:53Z

One minor question that I may often spend some time to figure out is the version of the language used in the repo. Sometimes that is directly specified in the readme, but not always. Sometimes it is specified in some build scripts, or you may need to tell by yourself. Similarly, the version of dependent artifacts could bother developers as well. Not sure that can be made as a RepoQA question.

ganler added the design label Mar 9, 2024

UniverseFly self-assigned this Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on QA design #7

Thoughts on QA design #7

UniverseFly commented Mar 9, 2024

ganler commented Mar 9, 2024

UniverseFly commented Mar 9, 2024

ganler commented Mar 9, 2024

ganler commented Mar 9, 2024

UniverseFly commented Mar 10, 2024 •

edited

Loading

claudeyj commented Apr 28, 2024

Thoughts on QA design #7

Thoughts on QA design #7

Comments

UniverseFly commented Mar 9, 2024

ganler commented Mar 9, 2024

UniverseFly commented Mar 9, 2024

ganler commented Mar 9, 2024

ganler commented Mar 9, 2024

UniverseFly commented Mar 10, 2024 • edited Loading

claudeyj commented Apr 28, 2024

UniverseFly commented Mar 10, 2024 •

edited

Loading