Code AutoComplete

code-autocomplete, a code completion plugin for Python.

code-autocomplete can automatically complete the code of lines and blocks with GPT2.

Guide

Feature
Install
Usage
Contact
Citation
Reference

Feature

GPT2-based code completion
Code completion for Python, other language is coming soon
Line and block code completion
Train(Fine-tune GPT2) and predict model with your own data

Demo

HuggingFace Demo: https://huggingface.co/spaces/shibing624/code-autocomplete

Install

pip3 install torch # conda install pytorch
pip3 install -U code-autocomplete

or

git clone https://github.com/shibing624/code-autocomplete.git
cd code-autocomplete
python3 setup.py install

Usage

Code Completion

Model upload to HF's model hub:

DistilGPT2-python: shibing624/code-autocomplete-distilgpt2-python (fine-tuned distilgpt2, model size: 319MB)
GPT2-python: shibing624/code-autocomplete-gpt2-base (fine-tuned gpt2, model size: 487MB)

Use with code-autocomplete

example: base_demo.py

from autocomplete.gpt2_coder import GPT2Coder

m = GPT2Coder("shibing624/code-autocomplete-gpt2-base")
print(m.generate('import torch.nn as')[0])

distilgpt2 fine-tuned code autocomplete model, you can use the following code:

example: distilgpt2_demo.py

import sys

sys.path.append('..')
from autocomplete.gpt2_coder import GPT2Coder

m = GPT2Coder("shibing624/code-autocomplete-distilgpt2-python")
print(m.generate('import torch.nn as')[0])

output:

import torch.nn as nn
import torch.nn.functional as F

Use with huggingface/transformers：

example: use_transformers_gpt2.py

Please use 'GPT2' related functions to load this model!

import os
from transformers import GPT2Tokenizer, GPT2LMHeadModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

tokenizer = GPT2Tokenizer.from_pretrained("shibing624/code-autocomplete-gpt2-base")
model = GPT2LMHeadModel.from_pretrained("shibing624/code-autocomplete-gpt2-base")
prompts = [
    "import numpy as np",
    "import torch.nn as",
    'parser.add_argument("--num_train_epochs",',
    "def set_seed(",
    "def factorial",
]
for prompt in prompts:
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids
    outputs = model.generate(input_ids=input_ids,
                             max_length=64 + len(input_ids[0]),
                             temperature=1.0,
                             top_k=50,
                             top_p=0.95,
                             repetition_penalty=1.0,
                             do_sample=True,
                             num_return_sequences=1,
                             length_penalty=2.0,
                             early_stopping=True,
                             pad_token_id=tokenizer.eos_token_id,
                             eos_token_id=tokenizer.eos_token_id,
                             )
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("Input :", prompt)
    print("Output:", decoded)
    print("=" * 20)

output:

import numpy as np
====================
import torch.nn as nn
import torchvision.transforms as transforms
====================
parser.add_argument("--num_train_epochs", type=int, default=50, help="Number of training epochs.")
parser.add_argument("--batch_size", type=int, default=32, help="Batch size of validation/test data.")
====================
def set_seed(self):
====================
def factorial(n: int) -> int:

Train your own model with Dataset

Build dataset

This allows to customize dataset building. Below is an example of the building process.

Let's use Python codes from Awesome-pytorch-list

We want the model to help auto-complete codes at a general level. The codes of The Algorithms suits the need.
This code from this project is well written (high-quality codes).

dataset tree:

examples/download/python
├── train.txt
└── valid.txt
└── test.txt

There are three ways to build dataset:

Use the huggingface/datasets library load the dataset huggingface datasets https://huggingface.co/datasets/shibing624/source_code

pip3 install datasets

from datasets import load_dataset
dataset = load_dataset("shibing624/source_code", "python") # python or java or cpp
print(dataset)
print(dataset['test'][0:10])

output:

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 5215412
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['text'],
        num_rows: 10000
    })
})
{'text': [
"            {'max_epochs': [1, 2]},\n", 
'            refit=False,\n', '            cv=3,\n', 
"            scoring='roc_auc',\n", '        )\n', 
'        search.fit(*data)\n', 
'', 
'    def test_module_output_not_1d(self, net_cls, data):\n', 
'        from skorch.toy import make_classifier\n', 
'        module = make_classifier(\n'
]}

Download dataset from Cloud

Name	Source	Download	Size
Python+Java+CPP source code	Awesome-pytorch-list(5.22 Million lines)	github_source_code.zip	105M

download dataset and unzip it, put to examples/.

Get source code from scratch and build dataset

prepare_data.py

cd examples
python prepare_data.py --num_repos 260

Train and predict model

example: train_gpt2.py

cd examples
python train_gpt2.py --do_train --do_predict --num_epochs 15 --model_dir outputs-fine-tuned --model_name gpt2

PS: fine-tuned result model is GPT2-python: shibing624/code-autocomplete-gpt2-base, i spent about 24 hours with V100 to fine-tune it.

Server

start FastAPI server:

example: server.py

cd examples
python server.py

open url: http://0.0.0.0:8001/docs

Contact

Issue(建议) ：
邮件我：xuming: [email protected]
微信我：加我微信号：xuming624, 备注：个人名称-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了code-autocomplete，请按如下格式引用：

APA:

Xu, M. code-autocomplete: Code AutoComplete with GPT2 model (Version 0.0.4) [Computer software]. https://github.com/shibing624/code-autocomplete

BibTeX:

@software{Xu_code-autocomplete_Code_AutoComplete,
author = {Xu, Ming},
title = {code-autocomplete: Code AutoComplete with GPT2 model},
url = {https://github.com/shibing624/code-autocomplete},
version = {0.0.4}
}

License

授权协议为 The Apache License 2.0，可免费用做商业用途。请在产品说明中附加code-autocomplete的链接和授权协议。

Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

在tests添加相应的单元测试
使用python setup.py test来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github		.github
autocomplete		autocomplete
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code AutoComplete

Feature

Demo

Install

Usage

Code Completion

Use with code-autocomplete

Use with huggingface/transformers：

Train your own model with Dataset

Build dataset

Train and predict model

Server

Contact

Citation

License

Contribute

Reference

About

Releases

Packages

Languages

License

raphaelbiermann/code-autocomplete

Folders and files

Latest commit

History

Repository files navigation

Code AutoComplete

Feature

Demo

Install

Usage

Code Completion

Use with code-autocomplete

Use with huggingface/transformers：

Train your own model with Dataset

Build dataset

Train and predict model

Server

Contact

Citation

License

Contribute

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages