repo
: entrypoint for working repositoriesrepoqa
: source code for the RepoQA evaluation libraryscripts
: scripts for maintaining the repository and other utilitiesdev
: scripts for CI/CD and repository maintenancecurate
: code for dataset curationdep_analysis
: dependency analysis for different programming languages
cherrypick
: cherry-picked repositories for evaluationdemos
: demos to quickly use some utility functions such as requesting LLMs
See scripts/cherrypick/README.md for more information.
Tip
Output: Extend scripts/cherrypick/lists.json
for a programming language.
python scripts/curate/dataset_ensemble_clone.py
Tip
Output: repoqa-{datetime}.json
by adding a "content"
field (path to content) for each repo.
Check scripts/curate/dep_analysis for more information.
python scripts/curate/dep_analysis/{language}.py # python
Tip
Output: {language}.json
(e.g., python.json
) with a list of items of {"repo": ..., "commit_sha": ..., "dependency": ...}
field where the dependency is a map of path to imported paths.
Note
The {language}.json
should be uploaded as a release.
To fetch the release, go to scripts/curate/dep_analysis/data
and run gh release download dependency --pattern "*.json" --clobber
.
python scripts/curate/merge_dep.py --dataset-path repoqa-{datetime}.json
Tip
Input: Download dependency files in to scripts/curate/dep_analysis/data
.
Output: Update repoqa-{datetime}.json
by adding a "dependency"
field for each repository.
# collect functions (in-place)
python scripts/curate/function_analysis.py --dataset-path repoqa-{datetime}.json
# select needles (in-place)
python scripts/curate/needle_selection.py --dataset-path repoqa-{datetime}.json
Tip
Output: --dataset-path
(in-place) by adding a "functions"
field (path to a list function information) for each repo.
python scripts/curate/needle_annotation.py --dataset-path repoqa-{datetime}.json
Tip
You need to set OPENAI_API_KEY
in the environment variable to run GPT-4. But you can enable --use-batch-api
to save some costs.
Output: --output-desc-path
is a seperate json file specifying the function annotations with its sources.
python scripts/curate/needle_annotation.py --dataset-path repoqa-{datetime}.json --annotation-path {output-desc-path}.jsonl
Tip
Output: --dataset-path
(in-place) by adding a "description"
field for each needle function.
pip install pre-commit
pre-commit install
pip install -r requirements.txt
pip install -r scripts/curate/requirements.txt
# Go to the root path of RepoQA
export PYTHONPATH=$PYTHONPATH:$(pwd)