Skip to content

Commit

Permalink
inference run
Browse files Browse the repository at this point in the history
  • Loading branch information
waleko committed Sep 13, 2023
1 parent 72a6fba commit dc1a529
Show file tree
Hide file tree
Showing 9 changed files with 242,929 additions and 19,083 deletions.
60,511 changes: 60,511 additions & 0 deletions data/JetBrains_kotlin_1000.finetuned_pred.csv

Large diffs are not rendered by default.

60,511 changes: 60,511 additions & 0 deletions data/JetBrains_kotlin_1000.hf_pred.csv

Large diffs are not rendered by default.

24,091 changes: 24,091 additions & 0 deletions data/microsoft_vscode_1000.finetuned_pred.csv

Large diffs are not rendered by default.

24,091 changes: 24,091 additions & 0 deletions data/microsoft_vscode_1000.hf_pred.csv

Large diffs are not rendered by default.

9,640 changes: 4,820 additions & 4,820 deletions data/msg-test.finetuned_pred.csv

Large diffs are not rendered by default.

28,320 changes: 14,160 additions & 14,160 deletions data/msg-test.hf_pred.csv

Large diffs are not rendered by default.

27,317 changes: 27,317 additions & 0 deletions data/transloadit_uppy_1000.finetuned_pred.csv

Large diffs are not rendered by default.

27,317 changes: 27,317 additions & 0 deletions data/transloadit_uppy_1000.hf_pred.csv

Large diffs are not rendered by default.

214 changes: 111 additions & 103 deletions notebooks/2_inference.ipynb
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "b65e7acd52a4668e",
"metadata": {
"collapsed": false
},
"source": [
"# CodeReviewer Model Inference\n",
"\n",
"Let's generate code reviews using `microsoft/codereviewer` model {cite}`li2022codereviewer`."
],
"metadata": {
"collapsed": false
},
"id": "b65e7acd52a4668e"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 11,
"id": "initial_id",
"metadata": {
"collapsed": true
Expand All @@ -34,63 +35,41 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"source": [
"## 1 Tokenizers and Datasets\n",
"\n",
"P.S. Incredible thanks to the authors of {cite}`p4vv37_codebert_2023` for providing the code for working with the tokenizer and the dataset. "
],
"id": "4a4776ea7be212fc",
"metadata": {
"collapsed": false
},
"id": "4a4776ea7be212fc"
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"filename = \"../data/msg-test.csv\""
],
"metadata": {
"collapsed": false
},
"id": "cce0ce4281d436df"
"## 1 Tokenizers and Datasets\n",
"\n",
"P.S. Enormous thanks to the authors of {cite}`p4vv37_codebert_2023` for providing open-source for working with the tokenizer and the dataset. "
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"df = pd.read_csv(filename)\n",
"df['msg'].fillna('', inplace=True)\n",
"df['src_file'].fillna('', inplace=True)\n",
"df.head()"
],
"execution_count": 12,
"id": "ad4d16d13804be69",
"metadata": {
"collapsed": false
},
"id": "593e84ddf70822c"
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# download tokenizer from huggingface\n",
"tokenizer = AutoTokenizer.from_pretrained(\"microsoft/codereviewer\")\n",
"\n",
"# add required special tokens to the tokenizer\n",
"tokenizer = utils.process_tokenizer(tokenizer)"
],
"metadata": {
"collapsed": false
},
"id": "ad4d16d13804be69"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 13,
"id": "cb003d6d8f578da1",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"class ReviewsDataset(Dataset):\n",
Expand All @@ -104,71 +83,74 @@
" \n",
" def __getitem__(self,idx):\n",
" return self.x[idx], self.y[idx]"
],
"metadata": {
"collapsed": false
},
"id": "cb003d6d8f578da1"
]
},
{
"attachments": {},
"cell_type": "markdown",
"source": [
"## 2 Load data\n",
"Here we load the data and create a dataloader for each project."
],
"id": "b01969568fe53c90",
"metadata": {
"collapsed": false
},
"id": "b01969568fe53c90"
"source": [
"## 2 Load data\n",
"Here we load the data and create a dataloader for each project."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 14,
"id": "d06f51b2150c61c4",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"filenames = ['../data/msg-test.csv', 'JetBrains_kotlin_1000.csv', 'microsoft_vscode_1000.csv', 'transloadit_uppy_1000.csv']\n",
"filenames = ['../data/msg-test.csv', '../data/JetBrains_kotlin_1000.csv', '../data/microsoft_vscode_1000.csv', '../data/transloadit_uppy_1000.csv']\n",
"\n",
"datasets = []\n",
"dataloaders = []\n",
"for filename in filenames:\n",
" df = pd.read_csv(filename)\n",
" dataset = ReviewsDataset(df, tokenizer)\n",
" datasets.append(dataset)\n",
" dataloader = DataLoader(dataset, batch_size=4, shuffle=False) # batch_size=6 for 8GB GPU\n",
" dataloader = DataLoader(dataset, batch_size=16, shuffle=False) # batch_size=6 for 8GB GPU\n",
" dataloaders.append(dataloader)"
],
"metadata": {
"collapsed": false
},
"id": "d06f51b2150c61c4"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1381eaca0f99dfc",
"metadata": {
"collapsed": false
},
"source": [
"## 3 Predict\n",
"\n",
"Now we can generate code reviews for each project. We will use two models:\n",
"- Pre-trained model from HuggingFace provided by the authors of {cite}`li2022codereviewer`\n",
"- Fine-tuned model on the CodeReviewer dataset"
],
"metadata": {
"collapsed": false
},
"id": "1381eaca0f99dfc"
]
},
{
"attachments": {},
"cell_type": "markdown",
"source": [
"### Predict function"
],
"id": "ef4c3e665f4be306",
"metadata": {
"collapsed": false
},
"id": "ef4c3e665f4be306"
"source": [
"### Predict function"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 15,
"id": "7a5b97449733bbc6",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def predict(model, dataloader, device='cuda'):\n",
Expand All @@ -194,28 +176,40 @@
" ), 1, preds_np)\n",
" result += list(preds_decoded)\n",
" return result"
],
"metadata": {
"collapsed": false
},
"id": "7a5b97449733bbc6"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d84900c15fa4ffc3",
"metadata": {
"collapsed": false
},
"source": [
"### HuggingFace pre-trained checkpoint\n",
"\n",
"The model is available on the HuggingFace model hub: https://huggingface.co/microsoft/codereviewer"
],
"metadata": {
"collapsed": false
},
"id": "d84900c15fa4ffc3"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"execution_count": 16,
"id": "c508661efcdcad40",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 636/636 [10:03<00:00, 1.05it/s]\n",
"100%|██████████| 63/63 [02:42<00:00, 2.58s/it]\n",
"100%|██████████| 63/63 [01:46<00:00, 1.70s/it]\n",
"100%|██████████| 63/63 [01:19<00:00, 1.26s/it]\n"
]
}
],
"source": [
"# download the pretrained model from huggingface\n",
"hf_model = AutoModelForSeq2SeqLM.from_pretrained(\"microsoft/codereviewer\")\n",
Expand All @@ -225,37 +219,55 @@
" df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})\n",
" df_pred.to_csv(Path(filename).with_suffix('.hf_pred.csv'))\n",
" df_pred.head()"
],
"metadata": {
"collapsed": false
},
"id": "c508661efcdcad40"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e8e932357e193796",
"metadata": {
"collapsed": false
},
"source": [
"### Fine-tuned CodeReviewer\n",
"\n",
"I fine-tuned the model on the CodeReviewer dataset on the `msg` task using the [instructions](https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer#3-finetuneinference) from the authors of {cite}`li2022codereviewer`.\n",
"\n",
"For the fine-tuning I used the following parameters:\n",
"- `batch_size=6`\n",
"- `batch_size=16`\n",
"- `learning_rate=3e-4`\n",
"- `max_source_length=512`\n",
"\n",
"The execution took about 12 hours on a single NVIDIA GeForce A100 GPU. The model was fine-tuned for 3 epochs.\n",
"\n",
"I have made the checkpoint available on the HuggingFace model hub: https://huggingface.co/waleko/codereviewer-finetuned-msg"
],
"metadata": {
"collapsed": false
},
"id": "e8e932357e193796"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"execution_count": 17,
"id": "851255e54c49484a",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Downloading (…)lve/main/config.json: 100%|██████████| 2.13k/2.13k [00:00<00:00, 2.06MB/s]\n",
"Downloading pytorch_model.bin: 100%|██████████| 892M/892M [00:22<00:00, 40.2MB/s] \n",
"Some weights of the model checkpoint at waleko/codereviewer-finetuned-msg were not used when initializing T5ForConditionalGeneration: ['cls_head.weight', 'cls_head.bias']\n",
"- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"Downloading (…)neration_config.json: 100%|██████████| 168/168 [00:00<00:00, 40.1kB/s]\n",
"100%|██████████| 636/636 [14:22<00:00, 1.36s/it]\n",
"100%|██████████| 63/63 [01:31<00:00, 1.45s/it]\n",
"100%|██████████| 63/63 [01:24<00:00, 1.35s/it]\n",
"100%|██████████| 63/63 [01:21<00:00, 1.29s/it]\n"
]
}
],
"source": [
"# download the fine-tuned model\n",
"ft_model = AutoModelForSeq2SeqLM.from_pretrained(\"waleko/codereviewer-finetuned-msg\")\n",
Expand All @@ -265,11 +277,7 @@
" df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})\n",
" df_pred.to_csv(Path(filename).with_suffix('.finetuned_pred.csv'))\n",
" df_pred.head()"
],
"metadata": {
"collapsed": false
},
"id": "851255e54c49484a"
]
}
],
"metadata": {
Expand All @@ -281,14 +289,14 @@
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
"pygments_lexer": "ipython3",
"version": "3.8.16"
}
},
"nbformat": 4,
Expand Down

0 comments on commit dc1a529

Please sign in to comment.