Skip to content

Commit

Permalink
readme-update
Browse files Browse the repository at this point in the history
  • Loading branch information
RodionfromHSE committed Nov 17, 2023
1 parent b0d7d5b commit f6d16df
Showing 1 changed file with 18 additions and 10 deletions.
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,15 +67,17 @@ pip3 install -r requirements.txt
The training dataset is based on `saier/unarxive_citrec` [hf](https://huggingface.co/datasets/saier/unarxive_citrec).

*Details*:
<!-- Train: 9082, Valid: 702, Test: 568 -->
```yaml
Train size: 9082
Valid size: 702
Test size: 568
```
All the samples have length from `128` to `512` characters (TO-DO: characters -> tokens)
All the samples have length from `128` to `512` characters (TO-DO: characters -> tokens)\
More in `notebooks/data/dataset_download.ipynb`

After collecting the dataset, we carefully translated the samples from English to Russian using the OpenAI API. Details in `notebooks/data/dataset_translate.ipynb`
After collecting the dataset, we carefully translated the samples from English to Russian using the OpenAI API.\
Details in `notebooks/data/dataset_translate.ipynb`

#### Dataset for model comparison (EvalDataset)
This dataset is based on `turkic_xwmt`, `subset=ru-en`, `split=test` [hf](https://huggingface.co/datasets/turkic_xwmt).
Expand All @@ -86,26 +88,32 @@ Dataset size: 1000

Models comparison is based on bleu score of the translated samples and reference translation by OpenAI.

*Models*:
transformer-en-ru: `Helsinki-NLP/opus-mt-en-ru` [hf](https://huggingface.co/Helsinki-NLP/opus-mt-en-ru)
**Models**:\
transformer-en-ru: `Helsinki-NLP/opus-mt-en-ru` [hf](https://huggingface.co/Helsinki-NLP/opus-mt-en-ru)\
nnlb-1.3B-distilled: `facebook/nllb-200-distilled-1.3B` [hf](https://huggingface.co/facebook/nllb-200-distilled-1.3B)


**Results**:
```yaml
transformer-en-ru BLEU: 2.58
nnlb-1.3B-distilled BLEU: 2.55
```

Even though results aren't statistically important, transformer-en-ru model was chosen since it's faster and has smaller size.
Even though results aren't statistically important, transformer-en-ru model was chosen since it's faster and has smaller size.\
Details in `src/finetune/eval_bleu.py`

## Model finetuning

Simple seq2seq model finetuning transformer-en-ru.
Details in `notebooks/finetune/finetune.ipynb`.
Simple seq2seq model finetuning transformer-en-ru.\
Details in `notebooks/finetune/finetune.ipynb`.\
Model on [hf](https://huggingface.co/under-tree/transformer-en-ru)

**Fine-tuned model results:**
**Fine-tuned model results**:
```yaml
eval_loss: 0.656
eval_bleu: 67.197 (suspeciously high)
eval_bleu: 67.197
```
(BLEU is suspeciously high)



Expand Down

0 comments on commit f6d16df

Please sign in to comment.