Skip to content

Commit

Permalink
readme
Browse files Browse the repository at this point in the history
  • Loading branch information
RodionfromHSE committed Nov 17, 2023
1 parent 92078e2 commit b0d7d5b
Show file tree
Hide file tree
Showing 2 changed files with 57 additions and 136 deletions.
57 changes: 57 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,4 +55,61 @@ Project Organization

--------

## Installation

```bash
pip3 install -r requirements.txt
```

## Dataset

#### Training dataset
The training dataset is based on `saier/unarxive_citrec` [hf](https://huggingface.co/datasets/saier/unarxive_citrec).

*Details*:
<!-- Train: 9082, Valid: 702, Test: 568 -->
Train size: 9082
Valid size: 702
Test size: 568

All the samples have length from `128` to `512` characters (TO-DO: characters -> tokens)
More in `notebooks/data/dataset_download.ipynb`

After collecting the dataset, we carefully translated the samples from English to Russian using the OpenAI API. Details in `notebooks/data/dataset_translate.ipynb`

#### Dataset for model comparison (EvalDataset)
This dataset is based on `turkic_xwmt`, `subset=ru-en`, `split=test` [hf](https://huggingface.co/datasets/turkic_xwmt).

Dataset size: 1000

## Models comparison

Models comparison is based on bleu score of the translated samples and reference translation by OpenAI.

*Models*:
transformer-en-ru: `Helsinki-NLP/opus-mt-en-ru` [hf](https://huggingface.co/Helsinki-NLP/opus-mt-en-ru)
nnlb-1.3B-distilled: `facebook/nllb-200-distilled-1.3B` [hf](https://huggingface.co/facebook/nllb-200-distilled-1.3B)

**Results**:
transformer-en-ru BLEU: 2.58
nnlb-1.3B-distilled BLEU: 2.55

Even though results aren't statistically important, transformer-en-ru model was chosen since it's faster and has smaller size.
Details in `src/finetune/eval_bleu.py`

## Model finetuning

Simple seq2seq model finetuning transformer-en-ru.
Details in `notebooks/finetune/finetune.ipynb`.
Model on [hf](https://huggingface.co/under-tree/transformer-en-ru)

**Fine-tuned model results:**
eval_loss: 0.656
eval_bleu: 67.197 (suspeciously high)






<p><small>Project based on the <a target="_blank" href="https://drivendata.github.io/cookiecutter-data-science/">cookiecutter data science project template</a>. #cookiecutterdatascience</small></p>
136 changes: 0 additions & 136 deletions notebooks/data/dataset_generation.ipynb

This file was deleted.

0 comments on commit b0d7d5b

Please sign in to comment.