Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
…ormance into main
  • Loading branch information
waleko committed Sep 13, 2023
2 parents 609f437 + c1cce38 commit 72a6fba
Show file tree
Hide file tree
Showing 3 changed files with 105 additions and 62 deletions.
10 changes: 10 additions & 0 deletions notebooks/1_collect_reviews.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,16 @@
}
},
"id": "869d9413b6ad101"
},
{
"cell_type": "markdown",
"source": [
"Additionally, we will be using the test data from {cite}`li2022codereviewer` and their [dataset on zenodo](https://zenodo.org/record/6900648/preview/Comment_Generation.zip). This dataset is available at `data/msg-test.csv`."
],
"metadata": {
"collapsed": false
},
"id": "d3ff1fa6cb65fce7"
}
],
"metadata": {
Expand Down
151 changes: 89 additions & 62 deletions notebooks/2_inference.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"source": [
"# CodeReviewer Model Inference\n",
"\n",
"Let's generate code reviews using `microsoft/codereviewer` model {cite}`li2022codereviewer`"
"Let's generate code reviews using `microsoft/codereviewer` model {cite}`li2022codereviewer`."
],
"metadata": {
"collapsed": false
Expand Down Expand Up @@ -36,7 +36,9 @@
{
"cell_type": "markdown",
"source": [
"## 1 Load data "
"## 1 Tokenizers and Datasets\n",
"\n",
"P.S. Incredible thanks to the authors of {cite}`p4vv37_codebert_2023` for providing the code for working with the tokenizer and the dataset. "
],
"metadata": {
"collapsed": false
Expand All @@ -48,7 +50,7 @@
"execution_count": null,
"outputs": [],
"source": [
"filename = \"../data/JetBrains_kotlin_100.csv\""
"filename = \"../data/msg-test.csv\""
],
"metadata": {
"collapsed": false
Expand All @@ -61,6 +63,8 @@
"outputs": [],
"source": [
"df = pd.read_csv(filename)\n",
"df['msg'].fillna('', inplace=True)\n",
"df['src_file'].fillna('', inplace=True)\n",
"df.head()"
],
"metadata": {
Expand Down Expand Up @@ -93,31 +97,74 @@
" def __init__(self, df: pd.DataFrame, tokenizer):\n",
" self.y = df[\"human_review\"]\n",
" self.code = df[\"diff_hunk\"]\n",
" self.x = torch.tensor(df.apply(lambda row: utils.encode_diff(tokenizer, row[\"diff_hunk\"], row[\"msg\"], row[\"src_file\"]), axis=1), dtype=torch.long).cpu()\n",
" self.x = torch.tensor(df.apply(lambda row: utils.encode_diff(tokenizer, row[\"diff_hunk\"], '', ''), axis=1), dtype=torch.long).cpu()\n",
" \n",
" def __len__(self):\n",
" return len(self.y)\n",
" \n",
" def __getitem__(self,idx):\n",
" return self.x[idx], self.y[idx]\n"
" return self.x[idx], self.y[idx]"
],
"metadata": {
"collapsed": false
},
"id": "cb003d6d8f578da1"
},
{
"cell_type": "markdown",
"source": [
"## 2 Load data\n",
"Here we load the data and create a dataloader for each project."
],
"metadata": {
"collapsed": false
},
"id": "b01969568fe53c90"
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"dataset = ReviewsDataset(df, tokenizer)\n",
"dataloader = DataLoader(dataset, batch_size=4)"
"filenames = ['../data/msg-test.csv', 'JetBrains_kotlin_1000.csv', 'microsoft_vscode_1000.csv', 'transloadit_uppy_1000.csv']\n",
"\n",
"datasets = []\n",
"dataloaders = []\n",
"for filename in filenames:\n",
" df = pd.read_csv(filename)\n",
" dataset = ReviewsDataset(df, tokenizer)\n",
" datasets.append(dataset)\n",
" dataloader = DataLoader(dataset, batch_size=4, shuffle=False) # batch_size=6 for 8GB GPU\n",
" dataloaders.append(dataloader)"
],
"metadata": {
"collapsed": false
},
"id": "d06f51b2150c61c4"
},
{
"cell_type": "markdown",
"source": [
"## 3 Predict\n",
"\n",
"Now we can generate code reviews for each project. We will use two models:\n",
"- Pre-trained model from HuggingFace provided by the authors of {cite}`li2022codereviewer`\n",
"- Fine-tuned model on the CodeReviewer dataset"
],
"metadata": {
"collapsed": false
},
"id": "1381eaca0f99dfc"
},
{
"cell_type": "markdown",
"source": [
"### Predict function"
],
"metadata": {
"collapsed": false
},
"id": "e39dbbb045d46e1a"
"id": "ef4c3e665f4be306"
},
{
"cell_type": "code",
Expand Down Expand Up @@ -151,22 +198,14 @@
"metadata": {
"collapsed": false
},
"id": "d06f51b2150c61c4"
},
{
"cell_type": "markdown",
"source": [
"## 2 Predict"
],
"metadata": {
"collapsed": false
},
"id": "1381eaca0f99dfc"
"id": "7a5b97449733bbc6"
},
{
"cell_type": "markdown",
"source": [
"### HuggingFace pre-trained checkpoint"
"### HuggingFace pre-trained checkpoint\n",
"\n",
"The model is available on the HuggingFace model hub: https://huggingface.co/microsoft/codereviewer"
],
"metadata": {
"collapsed": false
Expand All @@ -179,70 +218,58 @@
"outputs": [],
"source": [
"# download the pretrained model from huggingface\n",
"hf_model = AutoModelForSeq2SeqLM.from_pretrained(\"microsoft/codereviewer\")"
"hf_model = AutoModelForSeq2SeqLM.from_pretrained(\"microsoft/codereviewer\")\n",
"\n",
"for filename, dataset, dataloader in zip(filenames, datasets, dataloaders):\n",
" preds = predict(hf_model, dataloader)\n",
" df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})\n",
" df_pred.to_csv(Path(filename).with_suffix('.hf_pred.csv'))\n",
" df_pred.head()"
],
"metadata": {
"collapsed": false
},
"id": "c508661efcdcad40"
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"preds = predict(hf_model, dataloader)"
],
"metadata": {
"collapsed": false
},
"id": "34cc9076a5720e17"
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})"
],
"metadata": {
"collapsed": false
},
"id": "4c8e9b7782d9e4ff"
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"cell_type": "markdown",
"source": [
"df_pred.head()"
"### Fine-tuned CodeReviewer\n",
"\n",
"I fine-tuned the model on the CodeReviewer dataset on the `msg` task using the [instructions](https://github.com/microsoft/CodeBERT/tree/master/CodeReviewer#3-finetuneinference) from the authors of {cite}`li2022codereviewer`.\n",
"\n",
"For the fine-tuning I used the following parameters:\n",
"- `batch_size=6`\n",
"- `learning_rate=3e-4`\n",
"- `max_source_length=512`\n",
"\n",
"The execution took about 12 hours on a single NVIDIA GeForce A100 GPU. The model was fine-tuned for 3 epochs.\n",
"\n",
"I have made the checkpoint available on the HuggingFace model hub: https://huggingface.co/waleko/codereviewer-finetuned-msg"
],
"metadata": {
"collapsed": false
},
"id": "30f151dcec1d5c33"
"id": "e8e932357e193796"
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"df_pred.to_csv(Path(filename).with_suffix('.hf_pred.csv'))"
],
"metadata": {
"collapsed": false
},
"id": "a72786a5b8953655"
},
{
"cell_type": "markdown",
"source": [
"### Fine-tuned CodeReviewer"
"# download the fine-tuned model\n",
"ft_model = AutoModelForSeq2SeqLM.from_pretrained(\"waleko/codereviewer-finetuned-msg\")\n",
"\n",
"for filename, dataset, dataloader in zip(filenames, datasets, dataloaders):\n",
" preds = predict(ft_model, dataloader)\n",
" df_pred = pd.DataFrame({'code': dataset.code, 'target': dataset.y, 'prediction': preds})\n",
" df_pred.to_csv(Path(filename).with_suffix('.finetuned_pred.csv'))\n",
" df_pred.head()"
],
"metadata": {
"collapsed": false
},
"id": "e8e932357e193796"
"id": "851255e54c49484a"
}
],
"metadata": {
Expand Down
6 changes: 6 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,9 @@ @inproceedings{post-2018-call
pages = "186--191",
}

@misc{p4vv37_codebert_2023,
title = {{CodeBERT} {CodeReviewer} - a {Hugging} {Face} {Space} by p4vv37},
url = {https://huggingface.co/spaces/p4vv37/CodeBERT_CodeReviewer},
abstract = {An interface for running “Microsoft CodeBERT CodeReviewer: Pre-Training for Automating Code Review Activities.” (microsoft/codereviewer) on GitHub commits},
urldate = {2023-09-13},
}

0 comments on commit 72a6fba

Please sign in to comment.