readme

RodionfromHSE · Nov 17, 2023 · b0d7d5b · b0d7d5b
1 parent 92078e2
commit b0d7d5b
Show file tree

Hide file tree

Showing 2 changed files with 57 additions and 136 deletions.
diff --git a/README.md b/README.md
@@ -55,4 +55,61 @@ Project Organization
 
 --------
 
+## Installation
+
+```bash
+pip3 install -r requirements.txt
+```
+
+## Dataset
+
+#### Training dataset
+The training dataset is based on `saier/unarxive_citrec` [hf](https://huggingface.co/datasets/saier/unarxive_citrec).
+
+*Details*:
+<!-- Train: 9082, Valid: 702, Test: 568 -->
+Train size: 9082
+Valid size: 702
+Test size: 568
+
+All the samples have length from `128` to `512` characters (TO-DO: characters -> tokens)
+More in `notebooks/data/dataset_download.ipynb`
+
+After collecting the dataset, we carefully translated the samples from English to Russian using the OpenAI API. Details in `notebooks/data/dataset_translate.ipynb`
+
+#### Dataset for model comparison (EvalDataset)
+This dataset is based on `turkic_xwmt`, `subset=ru-en`, `split=test` [hf](https://huggingface.co/datasets/turkic_xwmt).
+
+Dataset size: 1000
+
+## Models comparison
+
+Models comparison is based on bleu score of the translated samples and reference translation by OpenAI.
+
+*Models*:
+transformer-en-ru: `Helsinki-NLP/opus-mt-en-ru` [hf](https://huggingface.co/Helsinki-NLP/opus-mt-en-ru)
+nnlb-1.3B-distilled: `facebook/nllb-200-distilled-1.3B` [hf](https://huggingface.co/facebook/nllb-200-distilled-1.3B)
+
+**Results**:
+transformer-en-ru BLEU: 2.58
+nnlb-1.3B-distilled BLEU: 2.55
+
+Even though results aren't statistically important, transformer-en-ru model was chosen since it's faster and has smaller size.
+Details in `src/finetune/eval_bleu.py`
+
+## Model finetuning
+
+Simple seq2seq model finetuning transformer-en-ru.
+Details in `notebooks/finetune/finetune.ipynb`. 
+Model on [hf](https://huggingface.co/under-tree/transformer-en-ru)
+
+**Fine-tuned model results:**
+eval_loss: 0.656
+eval_bleu: 67.197 (suspeciously high)
+
+
+
+
+
+
 <p><small>Project based on the <a target="_blank" href="https://drivendata.github.io/cookiecutter-data-science/">cookiecutter data science project template</a>. #cookiecutterdatascience</small></p>
diff --git a/notebooks/data/dataset_generation.ipynb b/notebooks/data/dataset_generation.ipynb