inference run

waleko · Sep 13, 2023 · dc1a529 · dc1a529
1 parent 72a6fba
commit dc1a529
Show file tree

Hide file tree

Showing 9 changed files with 242,929 additions and 19,083 deletions.
diff --git a/data/JetBrains_kotlin_1000.finetuned_pred.csv b/data/JetBrains_kotlin_1000.finetuned_pred.csv
diff --git a/data/JetBrains_kotlin_1000.hf_pred.csv b/data/JetBrains_kotlin_1000.hf_pred.csv
diff --git a/data/microsoft_vscode_1000.finetuned_pred.csv b/data/microsoft_vscode_1000.finetuned_pred.csv
diff --git a/data/microsoft_vscode_1000.hf_pred.csv b/data/microsoft_vscode_1000.hf_pred.csv
diff --git a/data/msg-test.finetuned_pred.csv b/data/msg-test.finetuned_pred.csv
diff --git a/data/msg-test.hf_pred.csv b/data/msg-test.hf_pred.csv
diff --git a/data/transloadit_uppy_1000.finetuned_pred.csv b/data/transloadit_uppy_1000.finetuned_pred.csv
diff --git a/data/transloadit_uppy_1000.hf_pred.csv b/data/transloadit_uppy_1000.hf_pred.csv
diff --git a/notebooks/2_inference.ipynb b/notebooks/2_inference.ipynb
@@ -1,20 +1,21 @@
 {
  "cells": [
   {
+   "attachments": {},
    "cell_type": "markdown",
+   "id": "b65e7acd52a4668e",
+   "metadata": {
+    "collapsed": false
+   },
    "source": [
     "# CodeReviewer Model Inference\n",
     "\n",
     "Let's generate code reviews using `microsoft/codereviewer` model {cite}`li2022codereviewer`."
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "b65e7acd52a4668e"
+   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 11,
    "id": "initial_id",
    "metadata": {
     "collapsed": true
@@ -34,63 +35,41 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "source": [
-    "## 1 Tokenizers and Datasets\n",
-    "\n",
-    "P.S. Incredible thanks to the authors of {cite}`p4vv37_codebert_2023` for providing the code for working with the tokenizer and the dataset. "
-   ],
+   "id": "4a4776ea7be212fc",
    "metadata": {
     "collapsed": false
    },
-   "id": "4a4776ea7be212fc"
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "outputs": [],
    "source": [
-    "filename = \"../data/msg-test.csv\""
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "cce0ce4281d436df"
+    "## 1 Tokenizers and Datasets\n",
+    "\n",
+    "P.S. Enormous thanks to the authors of {cite}`p4vv37_codebert_2023` for providing open-source for working with the tokenizer and the dataset. "
+   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "outputs": [],
-   "source": [
-    "df = pd.read_csv(filename)\n",
-    "df['msg'].fillna('', inplace=True)\n",
-    "df['src_file'].fillna('', inplace=True)\n",
-    "df.head()"
-   ],
+   "execution_count": 12,
+   "id": "ad4d16d13804be69",
    "metadata": {
     "collapsed": false
    },
-   "id": "593e84ddf70822c"
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
    "outputs": [],
    "source": [
     "# download tokenizer from huggingface\n",
     "tokenizer = AutoTokenizer.from_pretrained(\"microsoft/codereviewer\")\n",
     "\n",
     "# add required special tokens to the tokenizer\n",
     "tokenizer = utils.process_tokenizer(tokenizer)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "ad4d16d13804be69"
+   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 13,
+   "id": "cb003d6d8f578da1",
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [],
    "source": [
     "class ReviewsDataset(Dataset):\n",
@@ -104,71 +83,74 @@
     "   \n",
     "    def __getitem__(self,idx):\n",
     "        return self.x[idx], self.y[idx]"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "cb003d6d8f578da1"
+   ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "source": [
-    "## 2 Load data\n",
-    "Here we load the data and create a dataloader for each project."
-   ],
+   "id": "b01969568fe53c90",
    "metadata": {
     "collapsed": false
    },
-   "id": "b01969568fe53c90"
+   "source": [
+    "## 2 Load data\n",
+    "Here we load the data and create a dataloader for each project."
+   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 14,
+   "id": "d06f51b2150c61c4",
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [],
    "source": [
-    "filenames = ['../data/msg-test.csv', 'JetBrains_kotlin_1000.csv', 'microsoft_vscode_1000.csv', 'transloadit_uppy_1000.csv']\n",
+    "filenames = ['../data/msg-test.csv', '../data/JetBrains_kotlin_1000.csv', '../data/microsoft_vscode_1000.csv', '../data/transloadit_uppy_1000.csv']\n",
     "\n",
     "datasets = []\n",
     "dataloaders = []\n",
     "for filename in filenames:\n",
     "    df = pd.read_csv(filename)\n",
     "    dataset = ReviewsDataset(df, tokenizer)\n",
     "    datasets.append(dataset)\n",
-    "    dataloader = DataLoader(dataset, batch_size=4, shuffle=False) # batch_size=6 for 8GB GPU\n",
+    "    dataloader = DataLoader(dataset, batch_size=16, shuffle=False) # batch_size=6 for 8GB GPU\n",
     "    dataloaders.append(dataloader)"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "d06f51b2150c61c4"
+   ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
+   "id": "1381eaca0f99dfc",
+   "metadata": {
+    "collapsed": false
+   },
    "source": [
     "## 3 Predict\n",
     "\n",
     "Now we can generate code reviews for each project. We will use two models:\n",
     "- Pre-trained model from HuggingFace provided by the authors of {cite}`li2022codereviewer`\n",
     "- Fine-tuned model on the CodeReviewer dataset"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "1381eaca0f99dfc"
+   ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "source": [
-    "### Predict function"
-   ],
+   "id": "ef4c3e665f4be306",
    "metadata": {
     "collapsed": false
    },
-   "id": "ef4c3e665f4be306"
+   "source": [
+    "### Predict function"
+   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 15,
+   "id": "7a5b97449733bbc6",
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [],
    "source": [
     "def predict(model, dataloader, device='cuda'):\n",
@@ -194,28 +176,40 @@
     "        ), 1, preds_np)\n",
     "        result += list(preds_decoded)\n",
     "    return result"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "7a5b97449733bbc6"
+   ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
+   "id": "d84900c15fa4ffc3",
+   "metadata": {
+    "collapsed": false
+   },
    "source": [
     "### HuggingFace pre-trained checkpoint\n",
     "\n",
     "The model is available on the HuggingFace model hub: https://huggingface.co/microsoft/codereviewer"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "d84900c15fa4ffc3"
+   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "outputs": [],
+   "execution_count": 16,
+   "id": "c508661efcdcad40",
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 636/636 [10:03<00:00,  1.05it/s]\n",
+      "100%|██████████| 63/63 [02:42<00:00,  2.58s/it]\n",
+      "100%|██████████| 63/63 [01:46<00:00,  1.70s/it]\n",
+      "100%|██████████| 63/63 [01:19<00:00,  1.26s/it]\n"
+     ]
+    }
+   ],
    "source": [
     "# download the pretrained model from huggingface\n",
     "hf_model = AutoModelForSeq2SeqLM.from_pretrained(\"microsoft/codereviewer\")\n",
@@ -225,37 +219,55 @@
     "    df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})\n",
     "    df_pred.to_csv(Path(filename).with_suffix('.hf_pred.csv'))\n",
     "    df_pred.head()"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "c508661efcdcad40"
+   ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
+   "id": "e8e932357e193796",
+   "metadata": {
+    "collapsed": false
+   },
    "source": [
     "### Fine-tuned CodeReviewer\n",
     "\n",
     "I fine-tuned the model on the CodeReviewer dataset on the `msg` task using the [instructions](https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer#3-finetuneinference) from the authors of {cite}`li2022codereviewer`.\n",
     "\n",
     "For the fine-tuning I used the following parameters:\n",
-    "- `batch_size=6`\n",
+    "- `batch_size=16`\n",
     "- `learning_rate=3e-4`\n",
     "- `max_source_length=512`\n",
     "\n",
     "The execution took about 12 hours on a single NVIDIA GeForce A100 GPU. The model was fine-tuned for 3 epochs.\n",
     "\n",
     "I have made the checkpoint available on the HuggingFace model hub: https://huggingface.co/waleko/codereviewer-finetuned-msg"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "e8e932357e193796"
+   ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "outputs": [],
+   "execution_count": 17,
+   "id": "851255e54c49484a",
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Downloading (…)lve/main/config.json: 100%|██████████| 2.13k/2.13k [00:00<00:00, 2.06MB/s]\n",
+      "Downloading pytorch_model.bin: 100%|██████████| 892M/892M [00:22<00:00, 40.2MB/s] \n",
+      "Some weights of the model checkpoint at waleko/codereviewer-finetuned-msg were not used when initializing T5ForConditionalGeneration: ['cls_head.weight', 'cls_head.bias']\n",
+      "- This IS expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
+      "- This IS NOT expected if you are initializing T5ForConditionalGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
+      "Downloading (…)neration_config.json: 100%|██████████| 168/168 [00:00<00:00, 40.1kB/s]\n",
+      "100%|██████████| 636/636 [14:22<00:00,  1.36s/it]\n",
+      "100%|██████████| 63/63 [01:31<00:00,  1.45s/it]\n",
+      "100%|██████████| 63/63 [01:24<00:00,  1.35s/it]\n",
+      "100%|██████████| 63/63 [01:21<00:00,  1.29s/it]\n"
+     ]
+    }
+   ],
    "source": [
     "# download the fine-tuned model\n",
     "ft_model = AutoModelForSeq2SeqLM.from_pretrained(\"waleko/codereviewer-finetuned-msg\")\n",
@@ -265,11 +277,7 @@
     "    df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})\n",
     "    df_pred.to_csv(Path(filename).with_suffix('.finetuned_pred.csv'))\n",
     "    df_pred.head()"
-   ],
-   "metadata": {
-    "collapsed": false
-   },
-   "id": "851255e54c49484a"
+   ]
   }
  ],
  "metadata": {
@@ -281,14 +289,14 @@
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.6"
+   "pygments_lexer": "ipython3",
+   "version": "3.8.16"
   }
  },
  "nbformat": 4,