diff --git a/notebooks/1_collect_reviews.ipynb b/notebooks/1_collect_reviews.ipynb index ee0f08e..96ba53e 100644 --- a/notebooks/1_collect_reviews.ipynb +++ b/notebooks/1_collect_reviews.ipynb @@ -204,6 +204,16 @@ } }, "id": "869d9413b6ad101" + }, + { + "cell_type": "markdown", + "source": [ + "Additionally, we will be using the test data from {cite}`li2022codereviewer` and their [dataset on zenodo](https://zenodo.org/record/6900648/preview/Comment_Generation.zip). This dataset is available at `data/msg-test.csv`." + ], + "metadata": { + "collapsed": false + }, + "id": "d3ff1fa6cb65fce7" } ], "metadata": { diff --git a/notebooks/2_inference.ipynb b/notebooks/2_inference.ipynb index 1e9774b..dad5968 100644 --- a/notebooks/2_inference.ipynb +++ b/notebooks/2_inference.ipynb @@ -5,7 +5,7 @@ "source": [ "# CodeReviewer Model Inference\n", "\n", - "Let's generate code reviews using `microsoft/codereviewer` model {cite}`li2022codereviewer`" + "Let's generate code reviews using `microsoft/codereviewer` model {cite}`li2022codereviewer`." ], "metadata": { "collapsed": false @@ -36,7 +36,9 @@ { "cell_type": "markdown", "source": [ - "## 1 Load data " + "## 1 Tokenizers and Datasets\n", + "\n", + "P.S. Incredible thanks to the authors of {cite}`p4vv37_codebert_2023` for providing the code for working with the tokenizer and the dataset. " ], "metadata": { "collapsed": false @@ -48,7 +50,7 @@ "execution_count": null, "outputs": [], "source": [ - "filename = \"../data/JetBrains_kotlin_100.csv\"" + "filename = \"../data/msg-test.csv\"" ], "metadata": { "collapsed": false @@ -61,6 +63,8 @@ "outputs": [], "source": [ "df = pd.read_csv(filename)\n", + "df['msg'].fillna('', inplace=True)\n", + "df['src_file'].fillna('', inplace=True)\n", "df.head()" ], "metadata": { @@ -93,31 +97,74 @@ " def __init__(self, df: pd.DataFrame, tokenizer):\n", " self.y = df[\"human_review\"]\n", " self.code = df[\"diff_hunk\"]\n", - " self.x = torch.tensor(df.apply(lambda row: utils.encode_diff(tokenizer, row[\"diff_hunk\"], row[\"msg\"], row[\"src_file\"]), axis=1), dtype=torch.long).cpu()\n", + " self.x = torch.tensor(df.apply(lambda row: utils.encode_diff(tokenizer, row[\"diff_hunk\"], '', ''), axis=1), dtype=torch.long).cpu()\n", " \n", " def __len__(self):\n", " return len(self.y)\n", " \n", " def __getitem__(self,idx):\n", - " return self.x[idx], self.y[idx]\n" + " return self.x[idx], self.y[idx]" ], "metadata": { "collapsed": false }, "id": "cb003d6d8f578da1" }, + { + "cell_type": "markdown", + "source": [ + "## 2 Load data\n", + "Here we load the data and create a dataloader for each project." + ], + "metadata": { + "collapsed": false + }, + "id": "b01969568fe53c90" + }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ - "dataset = ReviewsDataset(df, tokenizer)\n", - "dataloader = DataLoader(dataset, batch_size=4)" + "filenames = ['../data/msg-test.csv', 'JetBrains_kotlin_1000.csv', 'microsoft_vscode_1000.csv', 'transloadit_uppy_1000.csv']\n", + "\n", + "datasets = []\n", + "dataloaders = []\n", + "for filename in filenames:\n", + " df = pd.read_csv(filename)\n", + " dataset = ReviewsDataset(df, tokenizer)\n", + " datasets.append(dataset)\n", + " dataloader = DataLoader(dataset, batch_size=4, shuffle=False) # batch_size=6 for 8GB GPU\n", + " dataloaders.append(dataloader)" + ], + "metadata": { + "collapsed": false + }, + "id": "d06f51b2150c61c4" + }, + { + "cell_type": "markdown", + "source": [ + "## 3 Predict\n", + "\n", + "Now we can generate code reviews for each project. We will use two models:\n", + "- Pre-trained model from HuggingFace provided by the authors of {cite}`li2022codereviewer`\n", + "- Fine-tuned model on the CodeReviewer dataset" + ], + "metadata": { + "collapsed": false + }, + "id": "1381eaca0f99dfc" + }, + { + "cell_type": "markdown", + "source": [ + "### Predict function" ], "metadata": { "collapsed": false }, - "id": "e39dbbb045d46e1a" + "id": "ef4c3e665f4be306" }, { "cell_type": "code", @@ -151,22 +198,14 @@ "metadata": { "collapsed": false }, - "id": "d06f51b2150c61c4" - }, - { - "cell_type": "markdown", - "source": [ - "## 2 Predict" - ], - "metadata": { - "collapsed": false - }, - "id": "1381eaca0f99dfc" + "id": "7a5b97449733bbc6" }, { "cell_type": "markdown", "source": [ - "### HuggingFace pre-trained checkpoint" + "### HuggingFace pre-trained checkpoint\n", + "\n", + "The model is available on the HuggingFace model hub: https://huggingface.co/microsoft/codereviewer" ], "metadata": { "collapsed": false @@ -179,7 +218,13 @@ "outputs": [], "source": [ "# download the pretrained model from huggingface\n", - "hf_model = AutoModelForSeq2SeqLM.from_pretrained(\"microsoft/codereviewer\")" + "hf_model = AutoModelForSeq2SeqLM.from_pretrained(\"microsoft/codereviewer\")\n", + "\n", + "for filename, dataset, dataloader in zip(filenames, datasets, dataloaders):\n", + " preds = predict(hf_model, dataloader)\n", + " df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})\n", + " df_pred.to_csv(Path(filename).with_suffix('.hf_pred.csv'))\n", + " df_pred.head()" ], "metadata": { "collapsed": false @@ -187,62 +232,44 @@ "id": "c508661efcdcad40" }, { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "preds = predict(hf_model, dataloader)" - ], - "metadata": { - "collapsed": false - }, - "id": "34cc9076a5720e17" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], - "source": [ - "df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})" - ], - "metadata": { - "collapsed": false - }, - "id": "4c8e9b7782d9e4ff" - }, - { - "cell_type": "code", - "execution_count": null, - "outputs": [], + "cell_type": "markdown", "source": [ - "df_pred.head()" + "### Fine-tuned CodeReviewer\n", + "\n", + "I fine-tuned the model on the CodeReviewer dataset on the `msg` task using the [instructions](https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer#3-finetuneinference) from the authors of {cite}`li2022codereviewer`.\n", + "\n", + "For the fine-tuning I used the following parameters:\n", + "- `batch_size=6`\n", + "- `learning_rate=3e-4`\n", + "- `max_source_length=512`\n", + "\n", + "The execution took about 12 hours on a single NVIDIA GeForce A100 GPU. The model was fine-tuned for 3 epochs.\n", + "\n", + "I have made the checkpoint available on the HuggingFace model hub: https://huggingface.co/waleko/codereviewer-finetuned-msg" ], "metadata": { "collapsed": false }, - "id": "30f151dcec1d5c33" + "id": "e8e932357e193796" }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ - "df_pred.to_csv(Path(filename).with_suffix('.hf_pred.csv'))" - ], - "metadata": { - "collapsed": false - }, - "id": "a72786a5b8953655" - }, - { - "cell_type": "markdown", - "source": [ - "### Fine-tuned CodeReviewer" + "# download the fine-tuned model\n", + "ft_model = AutoModelForSeq2SeqLM.from_pretrained(\"waleko/codereviewer-finetuned-msg\")\n", + "\n", + "for filename, dataset, dataloader in zip(filenames, datasets, dataloaders):\n", + " preds = predict(ft_model, dataloader)\n", + " df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})\n", + " df_pred.to_csv(Path(filename).with_suffix('.finetuned_pred.csv'))\n", + " df_pred.head()" ], "metadata": { "collapsed": false }, - "id": "e8e932357e193796" + "id": "851255e54c49484a" } ], "metadata": { diff --git a/references.bib b/references.bib index 1a1de1e..bf785d9 100644 --- a/references.bib +++ b/references.bib @@ -17,3 +17,9 @@ @inproceedings{post-2018-call pages = "186--191", } +@misc{p4vv37_codebert_2023, + title = {{CodeBERT} {CodeReviewer} - a {Hugging} {Face} {Space} by p4vv37}, + url = {https://huggingface.co/spaces/p4vv37/CodeBERT_CodeReviewer}, + abstract = {An interface for running “Microsoft CodeBERT CodeReviewer: Pre-Training for Automating Code Review Activities.” (microsoft/codereviewer) on GitHub commits}, + urldate = {2023-09-13}, +}