This repository contains code for creating a dataset of NLP++ code examples (final dataset can be found here as well as code to fine-tune CodeLlama (7B- and 13B-parameter models) for autocompletion. For more information on CodeLlama see the paper and check out the models on HF.
Due to the relative paucity of NLP++ code (compared to, for example, Python or Java), compiling a dataset of sufficient size (~500B tokens) to pretrain an LLM is likely not possible (without even considering computational requirements). Thus the most promising option to create a code completion model for NLP++ is to adapt an existing model. To this end there are a few challenges to address:
The recently released CodeLlama is chosen as a foundation model due to its reported state-of-the-art performance and availability/accessibility. The 7B- and 13B-parameter base versions (i.e. not adapted to Python code or instruction tasks) are chosen in particular due to their lower computational cost compared to the larger versions (34B+).
Due to the uniqueness of NLP++ code and its (assumed) absence in the training dataset, there is likely a significant gap between the source and target distributions. To attempt to overcome this gap, the foundation model is finetuned on a Causal LM objective using a dataset of 1500+ NLP++ code examples compiled from GitHub and the help pages of the Visual Text website. Whether this is the optimal method to leverage the existing data and whether the distribution gap is surmountable at all with this dataset remain open questions.
Despite being the smallest of the CodeLlama releases, the 7B- and 13B-parameter models are still too large to train on a single 40gb A100 (with bf16, an input size of 1024, b.s. of 1, Adam optimizer, and gradient checkpointing) without some computational optimizations. Instead of implementing a more efficient adaptation method (for a good overview of these, see here), both models are fine-tuned in a multinode multi-gpu setup using HF trainer with DeepSpeed Zero Redundancy Optimizer (you can find the implementation and papers here) with optimizer and parameter offloading.
The final model weights can be found here. For some examples on implementing these for code-completion take a look at HF Code Autocompletion extension and Continue.
run_multinode.sh
: PBS script to launch distributed training on PBS cluster with torchrun. E.g. qsub run_multinode.sh
run_clm.py
: Training script. Adapted from HF script here.
utils/
Contains utilities for scraping and cleaning data.
scripts/
Misc data and testing scripts.
deepspeed_configs/
Various DeepSpeed configs. llama_z3_offload.json
is the config that was utlimately used for finetuning.