Code Completion for NLP++ with CodeLlama🦙

This repository contains code for creating a dataset of NLP++ code examples (final dataset can be found here as well as code to fine-tune CodeLlama (7B- and 13B-parameter models) for autocompletion. For more information on CodeLlama see the paper and check out the models on HF.

Overview

Due to the relative paucity of NLP++ code (compared to, for example, Python or Java), compiling a dataset of sufficient size (~500B tokens) to pretrain an LLM is likely not possible (without even considering computational requirements). Thus the most promising option to create a code completion model for NLP++ is to adapt an existing model. To this end there are a few challenges to address:

Choosing a foundation model

The recently released CodeLlama is chosen as a foundation model due to its reported state-of-the-art performance and availability/accessibility. The 7B- and 13B-parameter base versions (i.e. not adapted to Python code or instruction tasks) are chosen in particular due to their lower computational cost compared to the larger versions (34B+).

Overcoming the gap between the source and target domains

Due to the uniqueness of NLP++ code and its (assumed) absence in the training dataset, there is likely a significant gap between the source and target distributions. To attempt to overcome this gap, the foundation model is finetuned on a Causal LM objective using a dataset of 1500+ NLP++ code examples compiled from GitHub and the help pages of the Visual Text website. Whether this is the optimal method to leverage the existing data and whether the distribution gap is surmountable at all with this dataset remain open questions.

Training LLMs

Despite being the smallest of the CodeLlama releases, the 7B- and 13B-parameter models are still too large to train on a single 40gb A100 (with bf16, an input size of 1024, b.s. of 1, Adam optimizer, and gradient checkpointing) without some computational optimizations. Instead of implementing a more efficient adaptation method (for a good overview of these, see here), both models are fine-tuned in a multinode multi-gpu setup using HF trainer with DeepSpeed Zero Redundancy Optimizer (you can find the implementation and papers here) with optimizer and parameter offloading.

The final model weights can be found here. For some examples on implementing these for code-completion take a look at HF Code Autocompletion extension and Continue.

Repo

Finetune

run_multinode.sh: PBS script to launch distributed training on PBS cluster with torchrun. E.g. qsub run_multinode.sh

run_clm.py: Training script. Adapted from HF script here.

utils/ Contains utilities for scraping and cleaning data.

scripts/ Misc data and testing scripts.

deepspeed_configs/ Various DeepSpeed configs. llama_z3_offload.json is the config that was utlimately used for finetuning.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.project_cache/torch-extensions		.project_cache/torch-extensions
__pycache__		__pycache__
deepspeed_configs		deepspeed_configs
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
constants.py		constants.py
run.sh		run.sh
run_clm.py		run_clm.py
run_multinode.sh		run_multinode.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Completion for NLP++ with CodeLlama🦙

Overview

Choosing a foundation model

Overcoming the gap between the source and target domains

Training LLMs

Repo

Finetune

About

Releases

Packages

Languages

ashtonomy/nlp_pp_code_completion

Folders and files

Latest commit

History

Repository files navigation

Code Completion for NLP++ with CodeLlama🦙

Overview

Choosing a foundation model

Overcoming the gap between the source and target domains

Training LLMs

Repo

Finetune

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages