diff --git a/.dvc/config b/.dvc/config index e4688e6..4434bab 100644 --- a/.dvc/config +++ b/.dvc/config @@ -2,4 +2,4 @@ autostage = true remote = storage ['remote "storage"'] - url = gdrive://1XzdLLDSWCRT57Kj9ZqYlfTk_0phVu6Fz + url = gdrive://19-PaarPhbUW27F4XXpLXS1SBu0Dvzch1 diff --git a/.gitignore b/.gitignore index 0c96245..50ec321 100644 --- a/.gitignore +++ b/.gitignore @@ -88,3 +88,4 @@ target/ # Mypy cache .mypy_cache/ /data +/models diff --git a/README.md b/README.md index c735243..6bc970d 100644 --- a/README.md +++ b/README.md @@ -9,6 +9,7 @@ Project Organization ├── LICENSE ├── Makefile <- Makefile with commands like `make data` or `make train` ├── README.md <- The top-level README for developers using this project. + ├── conf <- Configuration files ├── data │ ├── external <- Data from third party sources. │ ├── interim <- Intermediate data that has been transformed. @@ -54,4 +55,74 @@ Project Organization -------- +## Installation + +```bash +pip3 install -r requirements.txt +``` + +## Dataset + +#### Training dataset +The training dataset is based on `saier/unarxive_citrec` [hf](https://huggingface.co/datasets/saier/unarxive_citrec). + +*Details*: +```yaml +Train size: 9082 +Valid size: 702 +Test size: 568 +``` + +All the samples have length from `128` to `512` characters (TO-DO: characters -> tokens)\ +More in `notebooks/data/dataset_download.ipynb` + +After collecting the dataset, we carefully translated the samples from English to Russian using the OpenAI API.\ +Details in `notebooks/data/dataset_translate.ipynb` + +#### Dataset for model comparison (EvalDataset) +This dataset is based on `turkic_xwmt`, `subset=ru-en`, `split=test` [hf](https://huggingface.co/datasets/turkic_xwmt). + +Dataset size: 1000 + +## Models comparison + +Models comparison is based on bleu score of the translated samples and reference translation by OpenAI. + +**Models**:\ +transformer-en-ru: `Helsinki-NLP/opus-mt-en-ru` [hf](https://huggingface.co/Helsinki-NLP/opus-mt-en-ru)\ +nnlb-1.3B-distilled: `facebook/nllb-200-distilled-1.3B` [hf](https://huggingface.co/facebook/nllb-200-distilled-1.3B) + + +**Results**: +```yaml +transformer-en-ru BLEU: 2.58 +nnlb-1.3B-distilled BLEU: 2.55 +``` + +Even though results aren't statistically important, transformer-en-ru model was chosen since it's faster and has smaller size.\ +Details in `src/finetune/eval_bleu.py` + +## Model finetuning + +Simple seq2seq model finetuning transformer-en-ru.\ +Details in `notebooks/finetune/finetune.ipynb`.\ +Model on [hf](https://huggingface.co/under-tree/transformer-en-ru) + +**Fine-tuned model results**: +```yaml +eval_loss: 0.656 +eval_bleu: 67.197 +``` +(BLEU is suspeciously high) + +## Translation App + +**Synonyms Searcher**\ +Simple version is based on `word2vec` model, namely `fasttext` ([link](https://fasttext.cc/docs/en/crawl-vectors.html)). We've chosen fasttext because it solves the problem of out-of-vocabulary words. + + + + + +
Project based on the cookiecutter data science project template. #cookiecutterdatascience
diff --git a/conf/.gitignore b/conf/.gitignore new file mode 100644 index 0000000..5b6b072 --- /dev/null +++ b/conf/.gitignore @@ -0,0 +1 @@ +config.yaml diff --git a/conf/config.yaml b/conf/config.yaml new file mode 100644 index 0000000..acfa789 --- /dev/null +++ b/conf/config.yaml @@ -0,0 +1,9 @@ +defaults: + - _self_ + - dataset: null + - model: null + - params: null + - setup: null + + +root: /Users/user010/Desktop/Programming/ML/En2RuTranslator diff --git a/conf/dataset/model_eval.yaml b/conf/dataset/model_eval.yaml new file mode 100644 index 0000000..567dd1c --- /dev/null +++ b/conf/dataset/model_eval.yaml @@ -0,0 +1,7 @@ +path: ${root}/data/processed/model_eval_results.csv + +cols: # cols to be used when calculating BLEU + reference: target + candidates: + - transformer-en-ru + - nnlb-1.3B-distilled \ No newline at end of file diff --git a/conf/dataset/model_eval_raw.yaml b/conf/dataset/model_eval_raw.yaml new file mode 100644 index 0000000..3caec78 --- /dev/null +++ b/conf/dataset/model_eval_raw.yaml @@ -0,0 +1 @@ +path: ${root}/data/processed/model_eval.csv \ No newline at end of file diff --git a/conf/dataset/unarxive.yaml b/conf/dataset/unarxive.yaml new file mode 100644 index 0000000..ddd2bb3 --- /dev/null +++ b/conf/dataset/unarxive.yaml @@ -0,0 +1 @@ +path: "waleko/unarXive-en2ru" \ No newline at end of file diff --git a/conf/model/fasttext_en.yaml b/conf/model/fasttext_en.yaml new file mode 100644 index 0000000..adc6df4 --- /dev/null +++ b/conf/model/fasttext_en.yaml @@ -0,0 +1,2 @@ +path: ${root}/models/embs/cc.en.100.bin +type: fasttext \ No newline at end of file diff --git a/conf/model/fasttext_ru.yaml b/conf/model/fasttext_ru.yaml new file mode 100644 index 0000000..2c816f9 --- /dev/null +++ b/conf/model/fasttext_ru.yaml @@ -0,0 +1,2 @@ +path: ${root}/models/embs/cc.ru.100.bin +type: fasttext \ No newline at end of file diff --git a/conf/model/nnlb_1.3B.yaml b/conf/model/nnlb_1.3B.yaml new file mode 100644 index 0000000..91c8c6d --- /dev/null +++ b/conf/model/nnlb_1.3B.yaml @@ -0,0 +1,2 @@ +name: nnlb-1.3B-distilled +model_and_tokenizer_name: facebook/nllb-200-distilled-1.3B \ No newline at end of file diff --git a/conf/model/opus_distilled_en_ru.yaml b/conf/model/opus_distilled_en_ru.yaml new file mode 100644 index 0000000..8a6e2a8 --- /dev/null +++ b/conf/model/opus_distilled_en_ru.yaml @@ -0,0 +1,4 @@ +name: opus-distilled-en-ru +model_and_tokenizer_name: "under-tree/transformer-en-ru" +output_dir: ${root}/models/${.name}/finetuned +type: seq2seq \ No newline at end of file diff --git a/conf/model/opus_en_ru.yaml b/conf/model/opus_en_ru.yaml new file mode 100644 index 0000000..5f6cdea --- /dev/null +++ b/conf/model/opus_en_ru.yaml @@ -0,0 +1,2 @@ +name: opus-en-ru +model_and_tokenizer_name: Helsinki-NLP/opus-mt-en-ru diff --git a/conf/model/random_attention_extractor.yaml b/conf/model/random_attention_extractor.yaml new file mode 100644 index 0000000..010e927 --- /dev/null +++ b/conf/model/random_attention_extractor.yaml @@ -0,0 +1 @@ +type: "random" \ No newline at end of file diff --git a/conf/notebooks/finetune/candidates_inference.yaml b/conf/notebooks/finetune/candidates_inference.yaml new file mode 100644 index 0000000..1ba38b7 --- /dev/null +++ b/conf/notebooks/finetune/candidates_inference.yaml @@ -0,0 +1,13 @@ +root: ??? + +nnlb_model: + name: nnlb-1.3B-distilled + model_and_tokenizer_name: facebook/nllb-200-distilled-1.3B + +mt_model: + name: transformer-en-ru + model_and_tokenizer_name: Helsinki-NLP/opus-mt-en-ru + +inference_dataset_path: ${root}/data/processed/model_eval.csv +results_path: ${root}/data/processed/model_eval_results.csv + \ No newline at end of file diff --git a/models/.gitkeep b/conf/notebooks/finetune/finetune.yaml similarity index 100% rename from models/.gitkeep rename to conf/notebooks/finetune/finetune.yaml diff --git a/conf/notebooks/finetune/model_eval.yaml b/conf/notebooks/finetune/model_eval.yaml new file mode 100644 index 0000000..4157507 --- /dev/null +++ b/conf/notebooks/finetune/model_eval.yaml @@ -0,0 +1,7 @@ +root: ??? +load_dataset_params: + path: 'turkic_xwmt' + name: 'ru-en' + split: 'test' +save_path: '${root}/data/processed/model_eval.csv' + diff --git a/conf/params/finetune.yaml b/conf/params/finetune.yaml new file mode 100644 index 0000000..ac3e392 --- /dev/null +++ b/conf/params/finetune.yaml @@ -0,0 +1,15 @@ +batch_size: 16 +max_length: 512 +train_args: + evaluation_strategy: epoch + learning_rate: 2e-5 + per_device_train_batch_size: ${..batch_size} + per_device_eval_batch_size: ${..batch_size} + weight_decay: 0.01 + save_total_limit: 3 + num_train_epochs: 4 + predict_with_generate: true + +wandb_args: + report_to: wandb + run_name: finetune \ No newline at end of file diff --git a/conf/setup/all_models_example.yaml b/conf/setup/all_models_example.yaml new file mode 100644 index 0000000..4b72b91 --- /dev/null +++ b/conf/setup/all_models_example.yaml @@ -0,0 +1,6 @@ +# @package _global_ + +defaults: + - /model@model1: opus_en_ru + - /model@model2: opus_distilled_en_ru + - override /dataset: unarxive \ No newline at end of file diff --git a/conf/setup/finetune.yaml b/conf/setup/finetune.yaml new file mode 100644 index 0000000..9f69d00 --- /dev/null +++ b/conf/setup/finetune.yaml @@ -0,0 +1,7 @@ +# @package _global_ + +defaults: + - /model@pretrained: opus_en_ru + - /model@finetuned: opus_distilled_en_ru + - override /dataset: unarxive + - override /params: finetune \ No newline at end of file diff --git a/conf/setup/inference.yaml b/conf/setup/inference.yaml new file mode 100644 index 0000000..627fd2f --- /dev/null +++ b/conf/setup/inference.yaml @@ -0,0 +1,7 @@ +# @package _global_ + +defaults: + - /model@opus_model: opus_en_ru + - /model@nnlb_model: nnlb_1.3B + - /dataset@inference_dataset: model_eval_raw + - /dataset@result_dataset: model_eval \ No newline at end of file diff --git a/conf/setup/prod.yaml b/conf/setup/prod.yaml new file mode 100644 index 0000000..988c68b --- /dev/null +++ b/conf/setup/prod.yaml @@ -0,0 +1,8 @@ +# @package _global_ + +defaults: + - /model@dest_synonym_searcher: fasttext_ru + - /model@src_synonym_searcher: fasttext_en + - /model@translator: opus_distilled_en_ru + - /model@attention_extractor: random_attention_extractor + - override /params: finetune \ No newline at end of file diff --git a/custom_utils/__init__.py b/custom_utils/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/custom_utils/config_handler.py b/custom_utils/config_handler.py new file mode 100644 index 0000000..6e01d21 --- /dev/null +++ b/custom_utils/config_handler.py @@ -0,0 +1,25 @@ +from omegaconf import OmegaConf +import json +import typing as tp +import os +from hydra import initialize_config_dir, compose + +__ROOT_DIR = os.path.abspath(os.path.join(os.path.dirname(__file__), "..")) +__CONFIG_DIR = os.path.join(__ROOT_DIR, "conf") + +def read_config(config_dir: str = __CONFIG_DIR, overrides: tp.Optional[tp.List[str]] = None) -> OmegaConf: + """ + :@param config_dir: path to config directory + :@param overrides: list of overrides (e.g. ["dataset=model_eval"]) + :@param set_to_none_empty_with_warn: if True, set empty values to None and print warning + :@return: OmegaConf object + """ + config_dir = os.path.abspath(config_dir) + with initialize_config_dir(config_dir=config_dir, version_base=None): + cfg = compose(config_name="config", overrides=overrides) + cfg = OmegaConf.create(OmegaConf.to_yaml(cfg, resolve=True)) + return cfg + +def pprint_config(cfg: OmegaConf) -> None: + "Pretty print config" + print(json.dumps(OmegaConf.to_container(cfg), indent=2)) diff --git a/data.dvc b/data.dvc index 1c30a79..e52ba8a 100644 --- a/data.dvc +++ b/data.dvc @@ -1,6 +1,6 @@ outs: -- md5: 255799c6a8913d73679631d546a9dd88.dir - nfiles: 13 +- md5: a04b7051c1e5067e29c68137c321dae1.dir + nfiles: 15 hash: md5 path: data - size: 19936558 + size: 21297660 diff --git a/models.dvc b/models.dvc new file mode 100644 index 0000000..cee5780 --- /dev/null +++ b/models.dvc @@ -0,0 +1,6 @@ +outs: +- md5: c32a6fc3220dba5ae7628692d397c852.dir + size: 4892930090 + nfiles: 2 + hash: md5 + path: models diff --git a/notebooks/data/dataset_generation.ipynb b/notebooks/data/dataset_generation.ipynb deleted file mode 100644 index 5b1124b..0000000 --- a/notebooks/data/dataset_generation.ipynb +++ /dev/null @@ -1,136 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(False, False)" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "PROMPT = \"\"\"\\\n", - "Ты профессиональный тестировщик больших языковых моделей.\n", - "Сейчас твоя задача составить запросы, которые требуют от модели **сгенерировать изображение** (картину или фото).\n", - "Эти запросы должны использовать **как явные инструкции, так и намёки**. Запросы должны быть **разнообразными** и иметь **разный уровень формальности**.\n", - "\n", - "Сгенирируй мне 10 таких запросов.\n", - "\n", - "Примеры:\n", - "Нарисуй, пожалуйста, фотоаппарат марки «Зенит» с красивым плетёным ремешком.\n", - "а можешь плиз нарисовать как мальчик и девочка на пляже строят замок из песка?\n", - "Изобрази мне кота Матроскина, который играет на гитаре.\n", - "фото как спичка горит, а кругом тают кубики льда\n", - "сделай мне иллюстрацию к маленькому принцу где он с розой разговаривает\n", - "Сделаешь картинку площади трех вокзалов в Москве?\n", - "хочу картинку с аниме девочкой\n", - "покажи мне портрет Иосифа Сталина\n", - "\n", - "Твои запросы:\n", - "\"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip3 install openai python-dotenv" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from dotenv import load_dotenv\n", - "import openai\n", - "import time\n", - "import numpy as np\n", - "import os\n", - "path_to_env = os.path.join('..', '.env')\n", - "load_dotenv()\n", - "\n", - "\n", - "openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n", - "\n", - "class QuestionGenerator:\n", - " def __init__(self, query: str, max_queries: int = 3):\n", - " self.query = query\n", - " self.max_queries = max_queries\n", - " \n", - " def send_query(self):\n", - " response = None\n", - " for _ in range(self.max_queries):\n", - " try:\n", - " response = openai.Completion.create(\n", - " model=\"text-babbage-001\",\n", - " prompt=self.query,\n", - " temperature=0.7,\n", - " max_tokens=100,\n", - " top_p=0.6,\n", - " frequency_penalty=0.5,\n", - " presence_penalty=0.0\n", - " )\n", - " # random sleep seconds \n", - " time.sleep(np.random.randint(1, 5))\n", - " break\n", - " except Exception as e:\n", - " print('Error', e)\n", - " \n", - " return response\n", - " \n", - " def parse_response(self, response):\n", - " if response is None:\n", - " return []\n", - " return response['choices'][0]['text'].strip().lower().split(', ')\n", - " \n", - " def __call__(self):\n", - " response = self.send_query()\n", - " samples = self.get_topics(response)\n", - " return samples" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "qg = QuestionGenerator(PROMPT)\n", - "qg()" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.5" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/notebooks/finetune/candidates_inference.ipynb b/notebooks/finetune/candidates_inference.ipynb new file mode 100644 index 0000000..31e6112 --- /dev/null +++ b/notebooks/finetune/candidates_inference.ipynb @@ -0,0 +1,1584 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The autoreload extension is already loaded. To reload it, use:\n", + " %reload_ext autoreload\n" + ] + } + ], + "source": [ + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/Users/user010/Desktop/Programming/ML/En2RuTranslator\n" + ] + } + ], + "source": [ + "import os\n", + "import sys\n", + "\n", + "root_dir = os.path.abspath(os.path.join(os.getcwd(), '../..'))\n", + "print(root_dir)\n", + "assert os.path.exists(root_dir), f'Could not find root directory at {root_dir}'\n", + "sys.path.insert(0, root_dir)\n", + "\n", + "from custom_utils.config_handler import read_config, pprint_config" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"root\": \"/Users/user010/Desktop/Programming/ML/En2RuTranslator\",\n", + " \"opus_model\": {\n", + " \"name\": \"opus-en-ru\",\n", + " \"model_and_tokenizer_name\": \"Helsinki-NLP/opus-mt-en-ru\"\n", + " },\n", + " \"nnlb_model\": {\n", + " \"name\": \"nnlb-1.3B-distilled\",\n", + " \"model_and_tokenizer_name\": \"facebook/nllb-200-distilled-1.3B\"\n", + " },\n", + " \"inference_dataset\": {\n", + " \"path\": \"/Users/user010/Desktop/Programming/ML/En2RuTranslator/data/processed/model_eval.csv\"\n", + " },\n", + " \"result_dataset\": {\n", + " \"path\": \"/Users/user010/Desktop/Programming/ML/En2RuTranslator/data/processed/model_eval_results.csv\",\n", + " \"cols\": {\n", + " \"reference\": \"target\",\n", + " \"candidates\": [\n", + " \"transformer-en-ru\",\n", + " \"nnlb-1.3B-distilled\"\n", + " ]\n", + " }\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "overrides = [\"setup=inference\"]\n", + "cfg = read_config(overrides=overrides)\n", + "pprint_config(cfg)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "from transformers import AutoModelForSeq2SeqLM, AutoTokenizer\n", + "sample_texts = [\"Hey, how are you?\", \"My name is John Smith, I live in the United States of America.\",\n", + " \"I love NLP and Transformers!\"]\n", + "\n", + "def get_translations(model: AutoModelForSeq2SeqLM, tokenizer: AutoTokenizer, sample_texts: list,\n", + " special_gen_params: dict = None) -> list:\n", + " special_gen_params = special_gen_params or {}\n", + " print(\"Tokenizing...\")\n", + " inputs = tokenizer(sample_texts, return_tensors=\"pt\", padding=True, truncation=True, max_length=600)\n", + " print(\"Generating...\")\n", + " translated_tokens = model.generate(\n", + " **inputs,\n", + " **special_gen_params,\n", + " max_length=600,\n", + " early_stopping=True\n", + " )\n", + " print(\"Decoding...\")\n", + " translated_texts = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)\n", + " return translated_texts\n", + "\n", + "def print_translations(source: list[str], target: list[str]):\n", + " assert len(source) == len(target), \"Source and target lists must be of same length\"\n", + " for src, tgt in zip(source, target):\n", + " print(f\"Source: {src}\")\n", + " print(f\"Target: {tgt}\")\n", + " print()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "model_and_tokenizer_name = cfg.nnlb_model.model_and_tokenizer_name\n", + "nnlb_tokenizer = AutoTokenizer.from_pretrained(model_and_tokenizer_name)\n", + "nnlb_model = AutoModelForSeq2SeqLM.from_pretrained(model_and_tokenizer_name)" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Tokenizing...\n", + "Generating...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/user010/Desktop/Programming/ML/En2RuTranslator/venv/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:399: UserWarning: `num_beams` is set to 1. However, `early_stopping` is set to `True` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `early_stopping`.\n", + " warnings.warn(\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Decoding...\n", + "Source: Hey, how are you?\n", + "Target: Привет, как дела?\n", + "\n", + "Source: My name is John Smith, I live in the United States of America.\n", + "Target: Меня зовут Джон Смит, я живу в Соединенных Штатах Америки.\n", + "\n", + "Source: I love NLP and Transformers!\n", + "Target: Я люблю НЛП и Трансформеров!\n", + "\n" + ] + } + ], + "source": [ + "translations = get_translations(nnlb_model, nnlb_tokenizer, sample_texts, \n", + " special_gen_params={\"forced_bos_token_id\": nnlb_tokenizer.lang_code_to_id[\"rus_Cyrl\"]})\n", + "print_translations(sample_texts, translations)" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/user010/Desktop/Programming/ML/En2RuTranslator/venv/lib/python3.11/site-packages/transformers/models/marian/tokenization_marian.py:197: UserWarning: Recommended: pip install sacremoses.\n", + " warnings.warn(\"Recommended: pip install sacremoses.\")\n" + ] + } + ], + "source": [ + "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n", + "\n", + "model_and_tokenizer_name = cfg.opus_model.model_and_tokenizer_name\n", + "opus_tokenizer = AutoTokenizer.from_pretrained(model_and_tokenizer_name)\n", + "opus_model = AutoModelForSeq2SeqLM.from_pretrained(model_and_tokenizer_name)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Tokenizing...\n", + "Generating...\n", + "Decoding...\n", + "Source: Hey, how are you?\n", + "Target: Привет, как дела?\n", + "\n", + "Source: My name is John Smith, I live in the United States of America.\n", + "Target: Меня зовут Джон Смит, я живу в Соединенных Штатах Америки.\n", + "\n", + "Source: I love NLP and Transformers!\n", + "Target: Я люблю NLP и Transformers!\n", + "\n" + ] + } + ], + "source": [ + "translations = get_translations(opus_model, opus_tokenizer, sample_texts)\n", + "print_translations(sample_texts, translations)" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + " | source | \n", + "target | \n", + "
---|---|---|
0 | \n", + "The intention would also be to infiltrate terr... | \n", + "Террористов также обучают методам проникновени... | \n", + "
1 | \n", + "Officials say that as the latest information a... | \n", + "По последним данным представителей власти , в ... | \n", + "
2 | \n", + "While the Balakot camp was reactivated by the ... | \n", + "Джаиш-е-Мухаммад возобновили работу террористи... | \n", + "
3 | \n", + "The incident in which Pakistan used drones to ... | \n", + "В качестве яркого примера новой стратегии паки... | \n", + "
4 | \n", + "Officials tell OneIndia that terror groups wou... | \n", + "Представители власти рассказали порталу OneInd... | \n", + "
\n", + " | source | \n", + "target | \n", + "transformer-en-ru | \n", + "nnlb-1.3B-distilled | \n", + "
---|---|---|---|---|
0 | \n", + "The intention would also be to infiltrate terr... | \n", + "Террористов также обучают методам проникновени... | \n", + "Кроме того, намерение состоит в том, чтобы про... | \n", + "Имеется в виду также проникновение террористов... | \n", + "
1 | \n", + "Officials say that as the latest information a... | \n", + "По последним данным представителей власти , в ... | \n", + "Официальные лица говорят, что в качестве самой... | \n", + "Официальные лица говорят, что по последней инф... | \n", + "
2 | \n", + "While the Balakot camp was reactivated by the ... | \n", + "Джаиш-е-Мухаммад возобновили работу террористи... | \n", + "В то время как лагерь в Балакоте был восстанов... | \n", + "В то время как лагерь Балакота был активирован... | \n", + "
3 | \n", + "The incident in which Pakistan used drones to ... | \n", + "В качестве яркого примера новой стратегии паки... | \n", + "Инцидент, в ходе которого Пакистан использовал... | \n", + "Инцидент, когда Пакистан использовал беспилотн... | \n", + "
4 | \n", + "Officials tell OneIndia that terror groups wou... | \n", + "Представители власти рассказали порталу OneInd... | \n", + "Официальные лица сообщают одной Индии, что тер... | \n", + "Официальные лица говорят OneIndia, что террори... | \n", + "
/home/jovyan/rodion/other/trans/notebooks/finetune/wandb/run-20231117_130518-fgt5hc0j
"
+ ],
+ "text/plain": [
+ "Epoch | \n", + "Training Loss | \n", + "Validation Loss | \n", + "Bleu | \n", + "Gen Len | \n", + "
---|---|---|---|---|
1 | \n", + "0.941200 | \n", + "0.653884 | \n", + "67.599700 | \n", + "127.287700 | \n", + "
2 | \n", + "0.697700 | \n", + "0.608778 | \n", + "69.201200 | \n", + "127.783500 | \n", + "
3 | \n", + "0.631200 | \n", + "0.589383 | \n", + "69.907900 | \n", + "127.373200 | \n", + "
4 | \n", + "0.587600 | \n", + "0.584227 | \n", + "70.363300 | \n", + "126.859000 | \n", + "
"
+ ],
+ "text/plain": [
+ "Run history:
eval/bleu ▂▅▇██▁ eval/gen_len ▅█▅▂▂▁ eval/loss █▃▂▁▁█ eval/runtime ██▇▇▇▁ eval/samples_per_second ▁▁▆███ eval/steps_per_second ▁▁▅▆▆█ train/epoch ▁▁▃▄▅▆▇████ train/global_step ▁▁▃▄▅▆▇████ train/learning_rate █▆▃▁ train/loss █▃▂▁ train/total_flos ▁ train/train_loss ▁ train/train_runtime ▁ train/train_samples_per_second ▁ train/train_steps_per_second ▁ Run summary:
eval/bleu 67.197 eval/gen_len 126.7394 eval/loss 0.65617 eval/runtime 194.0045 eval/samples_per_second 2.928 eval/steps_per_second 0.186 train/epoch 4.0 train/global_step 2272 train/learning_rate 0.0 train/loss 0.5876 train/total_flos 2196400063905792.0 train/train_loss 0.69742 train/train_runtime 1218.4183 train/train_samples_per_second 29.816 train/train_steps_per_second 1.865
View job at https://wandb.ai/wide-learning/huggingface/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjExNjYyNzQ4OA==/version_details/v0
Synced 5 W&B file(s), 0 media file(s), 11 artifact file(s) and 0 other file(s)"
+ ],
+ "text/plain": [
+ "./wandb/run-20231117_130518-fgt5hc0j/logs
"
+ ],
+ "text/plain": [
+ "