Code for ESEC/FSE 2023 demonstration paper, A Language Model trained on Java Methods with Train/Test Deduplication
Presented by:
This repository contains all the code and detailed instructions to rebuild Jam models in our HuggingFace Automatic Program Comprehension Lab hub. You can either go through the entire process from scratch including tokenization of raw source code data or just finetuning the models that we provide with the dataset that we provide as tokenized bins. We also provide the scripts for deduplication of any future test sets.
Need help? Contact us via our Discord Server
We thank Andrej Karpathy and Daniel Grittner for their work providing the NanoGPT and NanoGPT-LoRA code. This repository forks from Daniel Grittner's NanoGPT-LoRA repository, which is a forked from the original NanoGPT by Andrej Karpathy.
- To-do list
- Pre-trained Model Checkpoints
- Fine-tuning
- Deduplication toolkit
- Inference
- Dataset
- Entire process
- Re-Training
To set up your local environment, run the following command. We recommend the use of a virtual environment for running the experiements.
pip install -r requirements.txt
- If you only want to finetune one of our pre-trained models, please see Pre-trained Model Checkpoints, Fine-tuning, and Inference. Additionally recommend using Deduplication toolkit before inference on your own test set.
- If you only want to deduplicate your dataset, refer to subsection 6.3, please see Deduplication toolkit.
- If you want to re-train a model using our processed and tokenized dataset, please see Retraining
- if you want to scratch-train, by reprocessing the dataset, pleasde see Entire process and Re-Training
We release the model that we pre-trained.
Model | Description | Link |
---|---|---|
jam | This model is trained on jm52m only and trained for one epoch, which is ∼300,000 iterations. | link |
jam-so | This model is trained on so13m only and trained for one epoch, which is ∼300,000 iterations. | link |
jam-sojm | This model is trained on so13m and then jm52m for one epoch each after resetting the learning rate and decay. | link |
Likewise, you can use the script that we provide to download the model that we trained for fine-tuning or applications.
python3 download.py --repo_id=apcl/jam --repo_type=model
Please replace the --repo_id tag with the apcl/jam_so
or apcl/jam_sojm
depending on the model you wish to finetune.
These steps will show you how to fine-tune for the code summarization application from our paper. You can hack these scripts to do whatever you need for your own task.
Datset | Description | Link |
---|---|---|
funcom-java-long | is a dataset for source code summarization by Bansal et al. made available pre-publication we use to fine tune and test our model for source code summarization. This dataset is also annotate as "q90" in our scripts | link |
Please cite the use of the funcom-java-long dataset as follows:
@article{bansal2023human,
title={Towards modeling human attention from eye movements for neutral source code summarization},
author={Bansal, Aakash and Sharif, Bonita and McMillan, Collin},
journal={Proceedings of ACM Human-Computer Interaction, Vol. 7},
year={2023}
python3 train.py config/finetune_funcom.py
Note that you need to change the out_dir
in the finetune_funcom.py to the same dir
as your --local_dir
.
We release our test set as a .tar.gz
file in apcl/funcom-java-long repository. You can simiply run the following command to download and extract test set for inference.
python3 download_extract_file.py
--repo_id: the id of repository that you want to download files
--local_dir: directory that you want to put your files
--filename: name of the file that you want to download
We have already set the default parameters to the required parameters for downloading test set. If you just want to download and extract test set, you only need to run the command above.
To deduplicate the test data included in the training set, use the following command to deduplicate test data included in Java methods
python3 data/jam_jm52m/dedup_fctest.py --test_filename=tdats.test --lsh_dir=fc_lsh_parts --threshold=0.5 --dedup_outfile=dedup_testfids.txt --fundats_file==fundats-j1.pkl
--test_filename: file name of your test file
--lsh_dir: directory for lsh files
--threshold: control the level similarity; 0.7 would be a good threshold for Java 52 millions methods
--dedup_outfile: output file with function id and duplicate functions id in lists
--fundats_file: a pickle file that is a dictionary for raw function code with key = function id and value = raw code
--partstart: separate deduplication into several programs to speed up; minimum value 0
--partend: separate deduplication into several programs to speed up; maximum value 50
To deduplicate the test data included in Stackoverflow posts, use the following command.
python3 data/jam_so13m/dedup_stackoverflow.py --stackoverflow_text_id_filename=stackoverflow_txtfiles.pkl --fundats_file=fundats-j1.pkl --stackoverflow_text_filename=jam_so13m.pkl --dedup_outfile=dedup_testfids.txt --threshold=0.5 --test_filename=tdats.test
--stackoverflow_text_id_filename: a pickle file that is a list for stackoverflow file name
--fundats_file: a pickle file that is a dictionary for raw function code files with key = function id and value = raw code
--stackoverflow_text_filename: a pickle file that is a dictionary for stackoverflow's posts with key = post id and value = stackoverflow post
--dedup_outfile: output file with function id and duplicate functions id in lists
--threshold: control the level similarity;
--test_filename: file name of your test file
--lsh_outdir: directory for lsh files
--partstart: separate deduplication into several programs to speed up; minimum for partstart = 0
--partend: separate deduplication into several programs to speed up; maximum for partstart = 100
After you download and deduplicate the test set, you can simiply run command below for inference.
python sample_funcom.py --out_dir=outdir
--outdir: directory of the model that you want to use for inference
We release two datasets that we use to pre-train our models. You can use the scripts that we provide to download these datasets automatically.
Datset | Description | Link |
---|---|---|
jm52m | jm52m is a dataset we created containing 52m Java methods from 52k Java projects. The source code originated from the Merobase and Sourcerer data releases, supplemented by our own prior work in LeClair et al. It contains code uploaded to code repositories between 2008 and 2018. We then extracted every Java method from every file and project. We removed empty methods, methods from corrupt files, and methods with parsing errors | link |
so13m | so13m is a dataset containing 13m discussion threads from StackOverflow. The origin of the data is the StackExchange data dump from between January 2014 and December 2022. The threads cover a multitude of topics. This dataset serves as a natural language and (often) accompanying code in the domain of software engineering. Its inclusion could help downstream tasks depending on generating or understanding natural language. | link |
To download the required datasets automatically, you can run the following command.
python3 download.py --repo_id=apcl/jm52m --local_dir=./data/yourdir --repo_type=dataset
This will download the all the files in the repository. If you only want to download specific files, you can simply run the following command. Note: The above command will download the entire dataset including the deduplication files (total 200gb+). If you just want the raw data, use the --filename parameter like in the next command. Specific files you might want for the raw data are fundats-j1.json.gz
and jm52m.sql.gz
. Or you may wish to retrain your own Jam models using our processed and tokenized data in train.bin
and val.bin
.
Our raw data includes full traceability from the methods to the files and projects where those methods originate. The code for each method is in fundats-j1
as a Python dictionary. You may download either the json or the pickle version. The key for each method is an ID number. That ID number is the id
field of the functionalunits
table in the SQL dump. Other fields should be self-explanatory.
python3 download.py --repo_id=apcl/jm52m --filename=file.pkl --local_dir=./data/yourdir --repo_type=dataset
--repo_id: either apcl/jm52m or apcl/so13m; apcl/jm52m is for 52 million Java methods and apcl/so13m is for 13 million stackoverflow posts.
--filename: the name of the file that you want to download
--local_dir: the name of the directory that you want to put your files
--repo_type: the type of repo that you download the file; set to dataset if you donwload files from dataset repo
Again, you only need train.bin
and val.bin
if you only want to build your Jam models from scratch instead of going through the entire process. You can see more details on Re-Training. However, if you want to go through the entire process, you can check Entire process section.
To go through the entire process, you will need an extra step to generate the bin
files by your own and use these files to train your own models.
To generate 52 millions funcom Java methods, run the following command.
python3 data/jam_jm52m/prepare_fc_raw.py --num-proc=4 --q90testfids-file=q90testfids.pkl --fundats-file=fundats-j1.pkl
--num-proc: number of workers in .map() call
--q90testfids-file: funcom Java methods test set ID files
--fundats-file: Name of Java methods raw code files; It's a dictionary file with key = function id and values = raw code
You will need to download q90testfids.pkl for Java methods' ID on test set and fundats-j1.pkl as Java methods' raw code. You can download these two files in apcl/jm52m repository. You may want to refer to Dataset section to see how to download these files with the script that we release.
You can run the following command to generate 13 millions Stackoverflow posts data.
python3 data/jam_so13m/prepare_stackoverflow.py --num-proc=4 --stackoverflow_filename=jam_so13m.pkl
--stackoverflow_filename: Name of file for stackoverflow data; This is a dictionary file with key = post id and values = post text
After the script is done, it will have both train.bin
and val.bin
in either data/jam_jm52m
or data/jam_so13m
directory. Be sure to move it to the same directory as train.py
.
After generation of bin
files, you can refer to step 2 of Re-Training section for training your models.
You will need both train.bin
and val.bin
to train your models. bin
files can be downloaded in the following command.
python3 download.py --repo_id=apcl/jm52m --filename={train.bin | val.bin} --local_dir=./data/yourdir --repo_type=dataset
Note that you will need to put these two files into the same directory as train.py
.
If you want to train your own models from scratch and you only have one gpu, use the following command to train the model.
python3 train.py config/{train_funcom_raw | train_stackoverflow}.py
train_funcom_raw: traninig with 52 millions funcom Java methods
train_stackoverflow: training with 13 millions stackoverflow posts from sratch
If you have multiple gpus, use the following command to train the model.
torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:0 --nnodes=1 --nproc-per-node=1 train.py config/train_funcom_raw.py --out_dir=jam350m_jm
You may want to refer to this document to change the port number for rdzv-endpoint if you have multiple instances on the same machine. Otherwise, you will have two different training instances but updating the same model weights.
If you use this work in an academic paper, please cite the following:
@inproceedings{su2023language,
title={A Language Model of Java Methods with Train/Test Deduplication},
author={Chia-Yi Su and Aakash Bansal and Vijayanta Jain and Sepideh Ghanavati and Collin Mcmillan},
month={December},
year={2023},
booktitle={Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
location = {San Francisco, CA, USA},
series = {ESEC/FSE 2023}
}
PDF available here: https://arxiv.org/abs/2305.08286
We recommend a GPU of the NVidia Ampere architecture or newer, because the "bfloat16" format is essential for efficient computation with our scripts. For GPUs older than that, "float32" format can be used. However, the VRAM requirements are higher in that format and computations are slower. We used Nvidia A5000 GPUs, though an A4000 should be sufficient.