Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on QA design #7

Open
UniverseFly opened this issue Mar 9, 2024 · 6 comments
Open

Thoughts on QA design #7

UniverseFly opened this issue Mar 9, 2024 · 6 comments
Assignees
Labels

Comments

@UniverseFly
Copy link

Random examples:

  1. Functionality explanation

What's the purpose of func in file

  1. Design patterns

What is the design pattern used for file, respond with one of "OOP", "FP", etc.

  1. Dependencies

What libraries are used for the module module.py

  1. Functionality synthesis

Which function/class in this repo is responsible for xxx functionality?

  1. Error handling

How does the method method handles the case where xxx error...

  1. Algorithms

Describe the algorithm used in func?

  1. Testing

Which test checks for xxx case? What is xxx test for?

  1. General problem solving

What strategy is used to tackle the xxx challenge?

How do we justify the design?

@ganler
Copy link
Member

ganler commented Mar 9, 2024

Functionality explanation or repository explanation

Precise evaluation is a bit challenging unless we use an LLM-based classifier (which is fine but can limit the usage)

Design patterns

Interesting but I feel it's a bit hard to extract tho... like not everything has a design pattern.

Dependencies

I feel it would be best to perform tasks that cannot be done by static analyzers (see also #3 (comment))

Functionality synthesis

I assume this is the function search task?

Error handling

I think this is very interesting! We can ask that given a function, what possible exceptions can be thrown in the form of multi-choice? (using human annotation tho).

Algorithms

Interesting, tho I feel this is hard to do automatically -- but how about we ask about the time and space complexity?

7 & 8

I feel the definition is a bit vague here....

Global note: the task should involve cross-context as much as possible

@ganler ganler added the design label Mar 9, 2024
@UniverseFly
Copy link
Author

Do we want to make every question a multi-choice problem; then all Qs can be evaluated through exact match? The potential issue is that then we are not evaluating if the LLM can generate a valid answer.

@ganler
Copy link
Member

ganler commented Mar 9, 2024

I think for "exception" and "complexity" we don't need multi-choice -- just ask the model to follow a certain format via few-shot and we can parse. -- In that way, we evaluate if the LLM can generate a valid answer.

@ganler
Copy link
Member

ganler commented Mar 9, 2024

we can have multi-choice for open-ended questions and via manual annotation, i think.

as such we have 4 types of tasks and we can have 2~4 for each depending on how challenging it is to curate.

@UniverseFly UniverseFly self-assigned this Mar 9, 2024
@UniverseFly
Copy link
Author

UniverseFly commented Mar 10, 2024

For each repository, we want to (systematically) support a few types of tasks by asking "What a developer would want to know when they first came to a new repository?"

@ganler I've thought of this a bit. I feel understanding a repository may involve two stages:

  1. understand how its code is built (from repo developers' viewpoints)
  2. know how to use the repo (for repo users' viewpoints)

Taking transformers as an example, it can be:

  1. how does Trainer uniformly handle decoder-only and encoder-decoder models? What trick does it use?
  2. how to initialize a Trainer to train a GPT model?

I guess for RepoQA our focus is (1), though (1) somewhat subsumes part of (2) because developers of a repo should also know how to use it.

Focusing on (1), here are several general question types i can think of:

  1. Purpose understanding
  • what does the project do
  • what problem does it solve..
  1. Basic usage understanding
  • What are the main entry points of the repository, or is it just a library without any entries?
  1. Dependencies (optional, not simply listing the dependencies)
  • What library/framework is the project based on (e.g., transformers is based on torch, tensorflow, jax, etc., with torch being the most dependent)
  • Which library is used to calculate BLEU score?
  1. Architecture
  • (optional) What design strategy is used extensively, OOP/FP/etc.
  • What is the class hierarchy for AutoModelForCausalLM
  • How does the library unify torch and tensorflow inputs
  • Which abstract class is defined for Trainer?
  1. Functionalities (a bit overlap with func search and "know how to use the repo"?)
  • What universal method is used for generating outputs for a model?
  • Which method should the user override to define a custom training loss?
  • Why is the code snippet "..." commented?
  • Error handling & Algorithms..
  1. Language features
  • What's the purpose of the list comprehension code: ' ... ' in class C?
  • Why unsafe is used for fn xxx?
  1. Testing
  • Which edge cases are considered when testing fun
  • What test testframework is used, e.g., pytest, unittest?

I haven't considered the answer format. These are kinda open-ended questions and the responses will also be open-ended. Two options: 1) just multi-choice, evaluated using accuracy 2) NL output then evaluated through imprecise metrics (F1 scores / BLEU / ROUGE) like what NLP QA datasets do.

Also, I believe this process requires exclusive human inspection or human-gpt4 collaboration. Can't think of an automated process, which also is not necessary if we just have several hundred questions.

@claudeyj
Copy link
Collaborator

One minor question that I may often spend some time to figure out is the version of the language used in the repo. Sometimes that is directly specified in the readme, but not always. Sometimes it is specified in some build scripts, or you may need to tell by yourself. Similarly, the version of dependent artifacts could bother developers as well. Not sure that can be made as a RepoQA question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants