diff --git a/configs/config.quickstart.yml b/configs/config.quickstart.yml index 38c11f03b..da115da6a 100644 --- a/configs/config.quickstart.yml +++ b/configs/config.quickstart.yml @@ -19,6 +19,9 @@ experiment: best-model: perplexity + opusfilter: + config: default #Otherwise, specify path to opusfilter configuration 'configs/opusfilter/config.opusfilter.yaml' + marian-args: training-student: disp-freq: 10 @@ -30,11 +33,6 @@ marian-args: save-freq: 100 valid-freq: 100 after: 500u - decoding-backward: - mini-batch-words: 2000 - decoding-teacher: - mini-batch-words: 1000 - precision: float16 datasets: train: diff --git a/docs/configs/configuration_files.md b/docs/configs/configuration_files.md index 736b20176..e916f0f10 100644 --- a/docs/configs/configuration_files.md +++ b/docs/configs/configuration_files.md @@ -1,19 +1,20 @@ -# Configuration files +# Configuration Files -Configuration files are in [YAML](https://yaml.org/) format. -At the top level, they have two sections: +The configuration files for OpusDistillery are written in [YAML](https://yaml.org/) format and are divided into two main sections: -* `experiment`: contains all the relevant information for your experiment, except the information on which datasets to use. -* `datasets`: contains the infromation regarding the datasets used for training, development and evaluation. Datasets are explained in [Dataset importers](downloading_and_selecting_data.md). +- **`experiment`**: Contains the general setup and parameters for the experiment, excluding dataset information. +- **`datasets`**: Specifies the datasets used for training, development, and evaluation. Details about datasets can be found in [Dataset Importers](downloading_and_selecting_data.md). -At the beginning of your `experiment` section, you should define the following: +### Experiment Setup -* `dirname`: directory name where everything will be stored. -* `name`: name of the experiment you are running. All generated data and models will be stored in `dirname`/`name` -* `langpairs`: a list of the language pairs you want in your student model, with **two letter codes** +In the `experiment` section, the following key parameters must be defined: -```yaml +- **`dirname`**: The directory where all experiment outputs will be stored. +- **`name`**: The name of the experiment. All generated data and models will be saved under `dirname`/`name`. +- **`langpairs`**: A list of language pairs for the student model, using ISO two-letter language codes. +Example configuration: +```yaml experiment: dirname: test name: fiu-eng @@ -27,42 +28,45 @@ experiment: ### OpusFilter -We have added support for using [OpusFilter](https://github.com/Helsinki-NLP/OpusFilter), a tool for filtering and combining parallel corpora. For data filtering, instead of the default cleaning, you can choose to use opusfilter with a default configuration or with a specific configuration you provide. +OpusDistillery supports [OpusFilter](https://github.com/Helsinki-NLP/OpusFilter), a tool for filtering and combining parallel corpora. Instead of the default cleaning, you can choose to filter data using OpusFilter with either a default configuration or a custom configuration that you provide. -In the configuration file, if you want to use a [default](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/pipeline/clean/run-opusfilter.py#13) configuration, you can see how in [this example](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/configs/opusfilter/config.fiu-eng.opusfilter.yml#L33). Otherwise, you can specify the path to a specific file with an Opusfilter configuration such as [this one](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/configs/opusfilter/config.opusfilter.yml). +In the configuration file, if you want to use a default configuration, see this [example](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/pipeline/clean/run-opusfilter.py#13). +Otherwise, you can specify the path to a custom OpusFilter configuration file such as [this one](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/configs/opusfilter/config.opusfilter.yml). ```yaml opusfilter: - config: default # Otherwise, specify path to opusfilter configuration 'configs/opusfilter/config.opusfilter.yaml' + config: default # # Or specify the path to an OpusFilter configuration file ``` ### Bicleaner AI -At the moment, this is not working. +Currently, Bicleaner AI is not operational. Do you want to collaborate? Feel free to work on this [issue](https://github.com/Helsinki-NLP/OpusDistillery/issues/3). ## Teacher models -You can choose a teacher from OPUS-MT or from Hugging Face (beware, runs on CPU!). +You can select a teacher model from OPUS-MT or Hugging Face. ### OPUS-MT Teachers -It is defined by: +To specify an OPUS-MT teacher, use: * `opusmt-teacher` -This can be either of the following: -1. the URL to an OPUS-MT model +It can be one of the following: + +1. A URL to an OPUS-MT model: ```yaml opusmt-teacher: "https://object.pouta.csc.fi/Tatoeba-MT-models/fiu-eng/opus4m-2020-08-12.zip" ``` -2. the path to an OPUS-MT model, +2. A path to a local OPUS-MT model: + ```yaml opusmt-teacher: "/path/to/opus-mt/model" ``` -3. a list of OPUS-MT models that will be used all together (any combination of the previous two). +3. A list of OPUS-MT models: ```yaml opusmt-teacher: @@ -71,7 +75,7 @@ This can be either of the following: ``` -4. In the case of multilingual students, you can combine different teachers. In this case, it should be a dictionary, specifying each teacher per language pair. +4. For multilingual students, specify different teachers for each language pair: ```yaml opusmt-teacher: @@ -80,7 +84,7 @@ This can be either of the following: en-be: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-bel/opus+bt-2021-03-07.zip" ``` -5. `best` which will select the best teacher available for each language pair by checking the FLORES200+ scores from the [OPUS-MT dashboard](https://opus.nlpl.eu/dashboard). +5. Use the `best` option to automatically select the best teacher for each language pair, based on FLORES200+ scores from the [OPUS-MT dashboard](https://opus.nlpl.eu/dashboard/). ```yaml opusmt-teacher: "best" @@ -88,26 +92,68 @@ This can be either of the following: ### Hugging Face Teachers +You can also use a [Hugging Face](https://huggingface.co/) model as a teacher. + +* `modelname`: The model identifier from the Hugging Face hub. +* `modelclass`: The class of the model being loaded. + +```yaml + huggingface: + modelname: "Helsinki-NLP/opus-mt-mul-en" + modelclass: "transformers.AutoModelForSeq2SeqLM" +``` + +You can also configure the decoding options: + +```yaml + huggingface: + modelname: "HPLT/translate-et-en-v1.0-hplt_opus" + modelclass: "transformers.AutoModelForSeq2SeqLM" + config: + top_k: 50 + top_p: 0.90 + temperature: 0.1 + max_new_tokens: 128 +``` + +For models that use language tags, additional parameters are required: + +* `lang_info`: Set to True if language tags are needed. +* `lang_tags`: A mapping of language codes to the tags used by the model. + +```yaml + huggingface: + modelname: "facebook/nllb-200-distilled-600M" + modelclass: "transformers.AutoModelForSeq2SeqLM" + lang_info: True + lang_tags: + en: eng_Latn + et: est_Latn +``` -It is defined like: +Finally, for models requiring a prompt, you can define it like this: ```yaml huggingface: - model: "facebook/nllb-200-distilled-600M" - task: translation #if not in config, assumes "translation by default" + modelname: "google-t5/t5-small" + modelclass: "transformers.AutoModelForSeq2SeqLM" + lang_tags: + en: English + de: German + prompt: "Translate {src_lang} to {tgt_lang}: {source}" ``` -Where model is the identifier from the hub and the task is a sequence-to-sequence task that produces translations with the pipeline implementation. +In this case, the lang_tags mapping will be used in the prompt. -When using a HF model as teacher, there is no scoring and no cross-entropy filtering. +Note: When using a Hugging Face model as a teacher, there is no scoring or cross-entropy filtering. ## Backward models -At the moment, the type of backward models available are only OPUS-MT. +Currently, only OPUS-MT models are available as backward models for scoring translations. -It is defined by: +To specify a backward model, use: -* `opusmt-backward`: the URL or path to an OPUS-MT model to be used as a backward model for scoring translations. As the teacher, it can also be a dictionary specifying a backward model per language pair as well as `best`. +* `opusmt-backward`: The URL or path to an OPUS-MT model. Like the teacher models, this can also be a dictionary for multilingual students or `best`. ```yaml opusmt-backward: @@ -119,27 +165,27 @@ It is defined by: If left empty, the cross-entropy filtering step will be skipped. ## Multilinguality -Specify if the teacher, the backward and the student models are many-to-one to be able to deal properly with language tags. By default, this is `False`. +Specify whether the teacher, backward, and student models are many-to-one to properly handle language tags. By default, this is set to `False`. -* `one2many-teacher`: `True` or `False` (default). If `opusmt-teacher` is "best", then this should be also "best" -* `one2many-backward`: `True` or `False` (default). If `opusmt-backward` is "best", then this should be also "best" +* `one2many-teacher`: `True` or `False` (default). If `opusmt-teacher` is set to `best`, this should also be `best`. +* `one2many-backward`: `True` or `False` (default). If `opusmt-backward` is set to `best`, this should also be `best`. * `one2many-student`: `True` or `False` (default). ```yaml -# Specify if the teacher and the student are many2one +# Specify if the teacher and the student are one2many one2many-teacher: True one2many-student: True ``` ## Training ### Marian arguments -These configs override pipeline/train/configs with [Marian settings](https://marian-nmt.github.io/docs/cmd/marian/) +You can override default pipeline settings with [Marian-specific settings](https://marian-nmt.github.io/docs/cmd/marian/). -The options are: `training-teacher`, `decoding-teacher`,`training-backward`, `decoding-backward`,`training-student`, `training-student-finetuned` +You can use the following options: `training-teacher`, `decoding-teacher`,`training-backward`, `decoding-backward`,`training-student`, `training-student-finetuned`. ```yaml marian-args: - #these configs override pipeline/train/configs + # These configs override pipeline/train/configs training-student: dec-depth: 3 enc-depth: 3 @@ -160,11 +206,12 @@ The options are: `training-teacher`, `decoding-teacher`,`training-backward`, `de ### Opustrainer -We have also added support for using [OpusTrainer](https://github.com/hplt-project/OpusTrainer), a tool for curriculum training and data augmentation. +OpusDistillery supports [OpusTrainer](https://github.com/hplt-project/OpusTrainer) for curriculum training and data augmentation. -In the configuration file, you can specify a path to the OpusTrainer configuration as in [here](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opustrainer/config.fiu-eng.opustrainer.yml#L37). However, this assumes that you already now the final paths of the data as specified in [here](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opustrainer/config.fiu-eng.opustrainer.stages.yml). +You can specify a path to the OpusTrainer configuration, such as in [this example](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opustrainer/config.fiu-eng.opustrainer.yml#L37). +This assumes you know the final paths of the data, as defined in [this file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opustrainer/config.fiu-eng.opustrainer.stages.yml). -At the moment, this is only implemented for student training. +Currently, this is implemented only for student training. ```yaml opustrainer: @@ -174,7 +221,7 @@ At the moment, this is only implemented for student training. ## Exporting -The final student model is in the Bergamot format, which makes use of shortlists for training (and these shorlists are trained using alignments). For that reason, we have also implemented the option to only train a student with the tiny architecture without the guided alignment. For that purpose, the user needs to specify "export" in the configuration file like this: +The final student model is exported in the Bergamot format, which uses shortlists for training. Shortlists are trained using alignments, so there's an option to train a student without guided alignment using the tiny architecture. To disable this, specify export in the configuration file: ```yaml export: "no" @@ -182,8 +229,9 @@ The final student model is in the Bergamot format, which makes use of shortlists ### Other -* `parallel-max-sentences`: maximum parallel sentences to download from each dataset. -* `split-length`: the amount of sentences into which you want to split your training data for forward translation. -* `best-model`: metric to select your best model. -* `spm-sample-size`: sample size to train spm vocabulary of the student. -* `student-prefix`: in case you want to train multiple students with exactly the same data, you can add this prefix which will allow you to train multiple students in the same directory structure. Find more about the directory structure [here](../pipeline/dir_structure.md). +* `parallel-max-sentences`: Maximum parallel sentences to download from each dataset. +* `split-length`: The number of sentences into which you want to split your training data for forward translation. +* `best-model`: Metric used to select the best model. +* `spm-sample-size`: Sample size for training the student’s SPM vocabulary. +* `spm-vocab-size`: Vocabulary size for training the student’s SPM vocabulary. +* `student-prefix`: To train multiple students with the same data, add a prefix to the student name, which will allow multiple students to be trained under the same directory structure with the same data. More details on the directory structure can be found [here](../pipeline/dir_structure.md). diff --git a/docs/configs/downloading_and_selecting_data.md b/docs/configs/downloading_and_selecting_data.md index eda4e86d4..b94c5276d 100644 --- a/docs/configs/downloading_and_selecting_data.md +++ b/docs/configs/downloading_and_selecting_data.md @@ -38,9 +38,8 @@ Make sure to check licenses of the datasets before using them. ## Adding a new importer -Just add a shell script to [corpus](https://github.com/Helsinki-NLP/OpusDistillery/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/Helsinki-NLP/OpusDistillery/tree/main/pipeline/data/importers/mono) which is named as `.sh` -and accepts the same parameters as the other scripts from the same folder. +Just add a shell script to [corpus](https://github.com/Helsinki-NLP/OpusDistillery/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/Helsinki-NLP/OpusDistillery/tree/main/pipeline/data/importers/mono) which is named as `.sh` and accepts the same parameters as the other scripts from the same folder. ## Issues -- Currently, it is not possible to download specific datasets per language pair, right now the tool only downloads the same dataset for all language pairs. If a dataset doesn't exist for a given language pair, it creates dummy files. -- Currently, there is no support to download monolingual datasets. The use of monolingual data is not implemented and only supports the use of bilingual data at the moment. +* Currently, it is not possible to download specific datasets per language pair; the tool downloads the same dataset for all language pairs. If a dataset doesn't exist for a given language pair, dummy files are created. Do you want to collaborate? Feel free to work on this [issue](https://github.com/Helsinki-NLP/OpusDistillery/issues/1). +* There is currently no support for downloading monolingual datasets. The use of monolingual data is not fully implemented; only bilingual data is supported at this time. Do you want to collaborate? Feel free to work on this [issue](https://github.com/Helsinki-NLP/OpusDistillery/issues/2). \ No newline at end of file diff --git a/docs/configs/examples.md b/docs/configs/examples.md index 539ad25a3..7ef9d55f6 100644 --- a/docs/configs/examples.md +++ b/docs/configs/examples.md @@ -1,34 +1,33 @@ # Examples of Configuration Files -Next we provide some configuration examples that will guide you for defining your own. - +Below are configuration examples to guide you in defining your own configurations. ## Multilinguality -The different possible distilling scenarios that we envision and that are covered are the following (o2m: one2many, m2o: many2one, m2m: many2many): +The following table illustrates different distillation scenarios (o2m: one-to-many, m2o: many-to-one, m2m: many-to-many): |ID | Configuration | Teacher | Student | Example config | |---|-----------------------|---------|---------|---------------------------------------------| -| 1 | bilingual - bilingual | en-et | en-et | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/config.1.o2o.o2o.yml) | -| 2 | o2m - bilingual | eng-fiu | en-et | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/config.2.o2m.o2o.yml) | -| 3 | o2m - o2m | eng-fiu | eng-fiu | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/config.3.o2m.o2m.yml) | -| 4 | m2o - bilingual | fiu-eng | et-en | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/config.4.m2o.o2o.yml) | -| 5 | m2o - m2o | fiu-eng | fiu-eng | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/config.5.m2o.m2o.yml) | -| 6 | m2m - bilingual | fiu-gmw | et-en | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/config.6.m2m.o2o.yml) | -| 7 | m2m - o2m | gmw-fiu | eng-fiu | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/config.7.m2m.o2m.yml) | -| 8 | m2m - m2o | fiu-gmw | fiu-eng | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/config.8.m2m.m2o.yml) | -| 9 | m2m - m2m | gmw-fiu | gmw-fiu | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/config.9.m2m.m2m.yml) | +| 1 | bilingual - bilingual | en-et | en-et | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/config.1.o2o.o2o.yml) | +| 2 | o2m - bilingual | eng-fiu | en-et | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/config.2.o2m.o2o.yml) | +| 3 | o2m - o2m | eng-fiu | eng-fiu | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/config.3.o2m.o2m.yml) | +| 4 | m2o - bilingual | fiu-eng | et-en | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/config.4.m2o.o2o.yml) | +| 5 | m2o - m2o | fiu-eng | fiu-eng | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/config.5.m2o.m2o.yml) | +| 6 | m2m - bilingual | fiu-gmw | et-en | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/config.6.m2m.o2o.yml) | +| 7 | m2m - o2m | gmw-fiu | eng-fiu | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/config.7.m2m.o2m.yml) | +| 8 | m2m - m2o | fiu-gmw | fiu-eng | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/config.8.m2m.m2o.yml) | +| 9 | m2m - m2m | gmw-fiu | gmw-fiu | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/config.9.m2m.m2m.yml) | ## Data Processing |ID | Configuration | Teacher | Student | Example config | |---|-----------------------|---------|---------|---------------------------------------------| -| 10 | OpusFilter | fiu-eng | fiu-eng | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opusfilter/config.fiu-eng.opusfilter.yml) | +| 10 | OpusFilter | fiu-eng | fiu-eng | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/opusfilter/config.fiu-eng.opusfilter.yml) | ## Training |ID | Configuration | Teacher | Student | Example config | |---|-----------------------|---------|---------|---------------------------------------------| -| 11 | OpusTrainer | fiu-eng | fiu-eng | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opustrainer/config.fiu-eng.opustrainer.yml) | +| 11 | OpusTrainer | fiu-eng | fiu-eng | [Config file](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/configs/opustrainer/config.fiu-eng.opustrainer.yml) | diff --git a/docs/index.rst b/docs/index.rst index 443016930..4c2d1e480 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -8,15 +8,17 @@ It is built on top of the `Firefox Translations Training pipeline `_, for training efficient NMT models that can run locally in a web browser. The pipeline is capable of training a translation model for any language pair(s) end to end. -Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. Some settings, especially low resource languages might require extra tuning. +Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. +Some settings, especially low resource languages might require extra tuning. -We use `Marian `_, the fast neural machine translation engine . +We use `Marian `_, the fast neural machine translation engine. New features: * **OPUS-MT models**: We have added the option to simply provide the URL of an existing OPUS-MT model. Our tool is also able to select the best available OpusMT model per language pair. -* **GPU Utilisation** With the hope of moving towards greener NLP and NMT, we have added GPU utilisation tracking so that we can report the amount of hours and energy consumed by the pipeline. +* **Hugging Face models**: You can also automatically distill from an existing model on Hugging Face. * **Multilinguality Support**: The pipeline supports training multilingual models. This covers two aspects: support for using any combination of multilingual and bilingual teachers, as well as support for multilingual student training. +* **GPU Utilisation** With the hope of moving towards greener NLP and NMT, we have added GPU utilisation tracking so that we can report the amount of hours and energy consumed by the pipeline. .. toctree:: :caption: Get started diff --git a/docs/installation.md b/docs/installation.md index 33e84fe45..e01b85f44 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -1,8 +1,72 @@ # Installation -## Getting started on CSC's puhti and mahti +This section describes how to set up the OpusDistillery pipeline locally, as well as on three of our supported clusters. + +## Locally + +### System Requirements + +- Ubuntu 18.04 (it can work on other Linux distributions, but might require `setup` scripts fixes; see more details in [marian installation instructions](https://marian-nmt.github.io/quickstart/)). +- One or several Nvidia GPUs with CUDA drivers installed and at least 8 GB of memory. +- CUDNN installed +- At least 16 CPU cores ( some steps of the pipeline utilize multiple cores pretty well, so the more the better). +- 64 GB RAM (128 GB+ might be required for bigger datasets) +- 200+ GB of disk space ( mostly for datasets and transformations ). + It depends on chosen datasets and can be significantly higher. + +### Installation + +0. Clone the repo: +``` +git clone https://github.com/mozilla/firefox-translations-training.git +cd firefox-translations-training +``` +1. Choose a [Snakemake profile](https://github.com/Snakemake-Profiles) from `profiles/` or create a new one +2. Adjust paths in the `Makefile` if needed and set `PROFILE` variable to the name of your profile +3. Adjust Snakemake and workflow settings in the `profiles//config.yaml`, see [Snakemake CLI reference](https://snakemake.readthedocs.io/en/stable/executing/cli.html) for details +4. Configure experiment and datasets in `configs/config.prod.yml` (or `configs/config.test.yml` for test run) +5. Change source code if needed for the experiment +6. **(Cluster mode)** Adjust cluster settings in the cluster profile. + For `slurm-moz`: `profiles/slurm-moz/config.cluster.yml` + You can also modify `profiles/slurm-moz/submit.sh` or create a new Snakemake [profile](https://github.com/Snakemake-Profiles). +7. **(Cluster mode)** It might require further tuning of requested resources in `Snakemake` file: + - Use `threads` for a rule to adjust parallelism + - Use `resources: mem_mb=` to adjust total memory requirements per task + (default is set in `profile/slurm-moz/config.yaml`) +8. Install Mamba - fast Conda package manager + +``` +make conda +``` + +9. Install Snakemake + +``` +make snakemake +``` + +10. Update git submodules + +``` +make git-modules +``` + +You are all set! + +## On a Cluster + +### System Requirements + +- Slurm cluster with CPU and Nvidia GPU nodes +- CUDA 11.2 ( it was also tested on 11.5) +- CUDNN library installed +- Singularity module if running with containerization (recommended) +- If running without containerization, there is no procedure to configure the environment automatically. + All the required modules (for example `parallel`) should be preinstalled and loaded in ~/.bashrc + +## Installation on Puhti and Mahti 1. Clone the repository. -2. Download the Ftt.sif container to the repository root (ask Ona) +2. Download the Ftt.sif container to the repository root (ask [Ona](ona.degibert@helsinki.fi)) 3. Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository): 1. The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On puhti and mahti, the python executables in /usr/bin/ should work: `/usr/bin/python3.9 -m venv snakemake_env`. 2. Activate the virtual environment: `source ./snakemake_env/bin/activate`. @@ -15,9 +79,9 @@ 9. Load cuda modules: module load gcc/9.4.0 cuda cudnn 10. Run pipeline: `make run-hpc PROFILE="slurm-puhti"` or `make run PROFILE="slurm-mahti"`. More information in [Basic Usage](usage.md). -## Getting started on CSC's lumi +## Installation on Lumi 1. Clone the repository. -2. Download the Ftt.sif container to the repository root (ask Ona) +2. Download the Ftt.sif container to the repository root (ask [Ona](ona.degibert@helsinki.fi)) 3. Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository): 1. The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On lumi, use the _cray-python_ module (it is not containerized): `module load cray-python; python -m venv snakemake_env`. 2. Activate the virtual environment: `source ./snakemake_env/bin/activate`. diff --git a/docs/pipeline.png b/docs/pipeline.png index d61c6f16f..6da983826 100644 Binary files a/docs/pipeline.png and b/docs/pipeline.png differ diff --git a/docs/pipeline/steps.md b/docs/pipeline/steps.md index 3ec3f5587..75ce63d6e 100644 --- a/docs/pipeline/steps.md +++ b/docs/pipeline/steps.md @@ -1,5 +1,16 @@ # Pipeline steps +Below is an overview of the pipeline steps: + +![Alt text](pipeline.png) + +The pipeline consists of five main steps: +* **Data Preprocessing**: Downloads data from publicly available repositories and handles basic data cleaning. +* **Synthetic Dataset Generation**: Downloads the relevant teacher and backward models, forward translates all source sentences with our teacher model(s) into the target languages, computes cross-entropy scores with a backward model, and then filters the synthetic dataset. +* **Student Training**: Trains a small transformer model on the filtered synthetic dataset with guided alignment. +* **Exporting**: Creates the final student. It includes a fine-tuning step, a quantization step and, finally, the export step which saves the model so it is ready for deployment. +* **Evaluation**: Evaluates the trained model. + The steps are based on [train-student](https://github.com/browsermt/students/tree/master/train-student) recipe. They can be represented as a Directly Acyclic Graph (DAG). @@ -9,14 +20,11 @@ Step | Description | Bottleneck | Comments Installation | Installing dependencies and compiling | CPU | Takes ~1 hour Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation. Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to [clean_parallel.py](/pipeline/clean/tools/clean_parallel.py). -Bicleaner | Filters noisy sentence pairs in a parallel corpus using [bicleaner](https://github.com/bitextor/bicleaner) or [bicleaner-ai](https://github.com/bitextor/bicleaner-ai) depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning). Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk | Training vocabulary | Trains [SentencePiece](https://github.com/google/sentencepiece) vocabulary/tokenizer model on parallel corpus. | CPU | -Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a [marian example](https://github.com/marian-nmt/marian-examples/tree/master/training-basics-sentencepiece). -Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others. -Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust [early stopping](pipeline/train/configs/training/teacher.transformer.train.yml) or `after-epochs` parameters depending on datasets size. -Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust [early stopping](pipeline/train/configs/training/teacher.transformer.train.yml) parameters depending on datasets size. -Translation by teacher | Translates a corpus and monolingual data combined from configurable `dataset.mono-src` using the ensemble of teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode. +Teacher download | Downloads teacher model | CPU | +Backward model download | Downloads backward model | CPU | +Translation by teacher | Translates a corpus using the teacher models | GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode. Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive. Training alignments and shortlist | Trains alignments using [fast_align](https://github.com/clab/fast_align) and extracts lexical shortlist using [extract_lex](https://github.com/marian-nmt/extract-lex) tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization. Training student | Trains a small transformer student model on filtered data and using alignments. Shuffling in RAM might fail if dataset is huge and there's not enough RAM on the machine, so it's recommended to remove it and use `shuffle: batches` marian settings (see [issue](https://github.com/mozilla/firefox-translations-training/issues/21)). | GPU | @@ -24,3 +32,28 @@ Fine-tuning student | Finetunes the student model by emulating 8bit GEMM during Quantizaiton | Applies 8 bit quantization to the fined-tuned student model and runs evaluation on CPU | CPU | CPU threads must be set to 1 for this step. Evaluation | Calculates metrics for all models (BLEU, chrf) using [SacreBLEU](https://github.com/mjpost/sacrebleu) | GPU | Uses `datasets.test` configuration section. Export | Exports trained model and shortlist to (bergamot-translator)(https://github.com/mozilla/bergamot-translator) format | | + +## Configurable steps + +Summary of OpusDistillery main steps. For each step, we report the compute resource used (CPU or GPU), whether the step is optional, and whether it is configurable or hard-coded. + +| **Main Step** | **Step** | **Resource** | **Optional** | **Configurable** | +| ----------------------------- | -------------------------- | ------------ | ------------ | ---------------- | +| **Data Processing** | | | | | +| | Data Download | CPU | ✗ | ✓ | +| | Data Cleaning | CPU | ✗ | ✓ | +| **Synthetic Dataset Generation**| | | | | +| | Teacher Model Download | CPU | ✗ | ✓ | +| | Forward Translation | GPU | ✗ | ✗ | +| | Backward Model Download | CPU | ✓ | ✓ | +| | Cross-Entropy Scoring | GPU | ✓ | ✗ | +| | Cross-Entropy Filtering | CPU | ✓ | ✓ | +| **Student Training** | | | | | +| | Alignment Training | CPU | ✓ | ✗ | +| | Vocabulary Training | CPU | ✗ | ✓ | +| | Student Training | GPU | ✗ | ✓ | +| **Exporting** | | | | | +| | Fine-tuning | GPU | ✓ | ✓ | +| | Quantization | CPU | ✓ | ✗ | +| | Export | - | ✓ | ✗ | +| **Evaluation** | Evaluation | GPU | ✓ | ✗ | \ No newline at end of file diff --git a/docs/quickstart.md b/docs/quickstart.md index f0c9fee3d..cb87244fa 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -1,61 +1,59 @@ # QuickStart Tutorial -This is a quickstart tutorial to run the OpusDistillery pipeline from scratch in your local machine for learning purposes. -On this example, we will use OPUS-MT models for sequence-level distillation from a multilingual teacher into a multilingual student. +This is a quickstart tutorial to run the OpusDistillery pipeline from scratch on your local machine for learning purposes. +In this example, we will use OPUS-MT models for sequence-level distillation from a multilingual teacher into a multilingual student. ## Pipeline Overview -Next, you can see an overview of the pipeline steps: +Below is an overview of the pipeline steps: ![Alt text](pipeline.png) -It mainly has four steps: -* **Data Preprocessing**: downloads data from publicly available repositories and takes care of basic data cleaning. -* **Synthetic Dataset Generation**: downloads the relevant teacher and backward models, forward translates all source sentences with our teacher model(s) into our target languages, computes cross-entropy scores with a backward model and then use them for filtering the synthetic dataset. -* **Student Training**: trains a small transformer model on the filtered synthetic dataset with guided alignment. -* **Evaluation**: evaluates the trained model. +The pipeline consists of five main steps: +* **Data Preprocessing**: Downloads data from publicly available repositories and handles basic data cleaning. +* **Synthetic Dataset Generation**: Downloads the relevant teacher and backward models, forward translates all source sentences with our teacher model(s) into the target languages, computes cross-entropy scores with a backward model, and then filters the synthetic dataset. +* **Student Training**: Trains a small transformer model on the filtered synthetic dataset with guided alignment. +* **Exporting**: Creates the final student. It includes a fine-tuning step, a quantization step and, finally, the export step which saves the model so it is ready for deployment. +* **Evaluation**: Evaluates the trained model. -For a more detailed description of the pipeline, check the [Pipeline Steps](pipeline/steps.md) section. +For a more detailed description of the pipeline, refer to the [Pipeline Steps](pipeline/steps.md) section. ## Pipeline Setup -For this tutorial, we will be running the pipeline locally. +In this tutorial, we will be running the pipeline locally. -1. Clone the repository and checkout to the multilingual branch `multi-ftt` +1. Clone the repository: ```bash git clone https://github.com/Helsinki-NLP/OpusDistillery.git - git checkout multi-ftt ``` -2. Install Mamba - fast Conda package manager +2. Install Mamba, a fast Conda package manager: ``` make conda ``` -3. Install Snakemake +3. Install Snakemake: ``` make snakemake ``` -4. Update git submodules +4. Update the git submodules: ``` make git-modules ``` - micromamba activate /home/degibert/Documents/0_Work/mambaforge - mamba activate snakemake - -5. Edit the local profile from [profiles/local/config.yaml](../profiles/local/config.yaml)' and enter the data directory path as the root value of the config section. This is the folder where all the outputs of the pipeline will be stored. +5. Edit the local profile in [profiles/local/config.yaml](../profiles/local/config.yaml) and specify the data directory path as the root value in the config section. +This folder will store all pipeline outputs: ``` root=/home/degibert/Documents/0_Work/OpusDistillery/data ``` -6. Make sure that everything is installed properly +6. Ensure everything is installed properly: ``` source ../mambaforge/etc/profile.d/conda.sh ; conda activate ; conda activate snakemake @@ -65,9 +63,9 @@ For this tutorial, we will be running the pipeline locally. ## Experiment Setup -Let's define a simple configuration file in YAML format. We will be using the [configs/config.quickstart.yml](../configs/config.quickstart.yml). +Let’s define a simple configuration file in YAML format. We will use [configs/config.quickstart.yml](../configs/config.quickstart.yml). -1. We define the directory structure (`data-dir/test/fiu-eng`) and specify the language pairs of the student model we want to distill. +1. We define the directory structure (`data-dir/test/fiu-eng`) and specify the language pairs of the student model we want to distill: ```yaml @@ -90,19 +88,19 @@ Let's define a simple configuration file in YAML format. We will be using the [c opusmt-backward: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-fiu/opus2m-2020-08-01.zip" ``` - The backward model is multilingual at the target side, it has multiple target languages, so we need to specify it: + Since the backward model is multilingual on the target side, so we need to specify it: ```yaml one2many-backward: True ``` -3. We define the metric to select our best model. +3. Define the metric to select the best model: ```yaml best-model: perplexity ``` -4. We define the maximum lines for splitting our files for forward translation. +4. Define the maximum number of lines for splitting files during forward translation: ```yaml split-length: 1000 @@ -110,21 +108,21 @@ Let's define a simple configuration file in YAML format. We will be using the [c ## Running the pipeline -To run the pipeline, run: +To run the pipeline, execute: ```bash make run CONFIG="configs/config.quickstart.yml" PROFILE="local" ``` -You can also create a directed acyclic graph to represent the steps the pipeline will take. +You can also create a directed acyclic graph (DAG) to represent the steps the pipeline will take: ```bash make dag CONFIG="configs/config.quickstart.yml" PROFILE="local" ``` -This will create a pdf in the root directory, named DAG.pdf, with the steps for this specific run. +This will generate the file `DAG.pdf` in the root directory, showing the steps for this specific run. -By default, all Snakemake rules are executed. To run the pipeline up to a specific rule use: +By default, all Snakemake rules are executed. To run the pipeline up to a specific rule, use: ```bash make run CONFIG="configs/config.quickstart.yml" PROFILE="local" TARGET="/home/degibert/Documents/0_Work/OpusDistillery/data/data/test/fiu-eng/original/et-en/devset.source.gz" diff --git a/docs/usage.md b/docs/usage.md index 30d76bbb9..c64c9220d 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,17 +1,14 @@ -# Basic usage +# Basic Usage -The pipeline is built with [Snakemake](https://snakemake.readthedocs.io/en/stable/). +The pipeline is built using [Snakemake](https://snakemake.readthedocs.io/en/stable/). -Snakemake workflow manager infers the DAG of tasks implicitly from the specified inputs and outputs of the steps. The workflow manager checks which files are missing and runs the corresponding jobs either locally or on a cluster depending on the configuration. +Snakemake is a workflow management system that implicitly constructs a Directed Acyclic Graph (DAG) of tasks based on the input and output files specified in each step. It determines which files are missing and executes the corresponding jobs, either locally or on a cluster, depending on the configuration. Snakemake can also parallelize steps that can be run concurrently. -Snakemake parallelizes steps that can be executed simultaneously. +The main Snakemake process (scheduler) should be launched interactively. It manages the job execution either on worker nodes in a cluster (cluster mode) or on a local machine (local mode). -The main Snakemake process (scheduler) should be launched interactively. It runs the job processes on the worker nodes in cluster mode or on a local machine in local mode. +## Configuration Examples -## Configuration examples - -The pipeline is run with the [Makefile](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/Makefile) which takes a configuration file as an input. -Configuration files are in [YAML](https://yaml.org/) format. Although we report details of the configuration files in [Setting up your experiment](configs/downloading_and_selecting_data.md), a configuration file that trains a student model (Estonian, Finnish and Hungarian into English) looks like this: +The pipeline is executed using the provided [Makefile](https://github.com/Helsinki-NLP/OpusDistillery/blob/main/Makefile), which takes a configuration file as input. Configuration files are written in [YAML](https://yaml.org/) format. You can find more details on configuration in the [Setting up your experiment](configs/downloading_and_selecting_data.md) section. Below is an example configuration file that trains a student model for Estonian, Finnish, and Hungarian into English: ```yaml @@ -46,82 +43,76 @@ datasets: ## Running -### On LUMI - -On LUMI, the pipeline is run from the login node from your local copy of the root repository. - -Start a tmux session: `tmux` -You can read more about [tmux](https://github.com/tmux/tmux/wiki) here. - -Load LUMI specific modules: - -```bash -module load CrayEnv -module load PrgEnv-cray/8.3.3 -module load craype-accel-amd-gfx90a -module load cray-python -module load rocm/5.3.3 -export SINGULARITYENV_LD_LIBRARY_PATH=$LD_LIBRARY_PATH -``` - -Activate snakemake environment: - -```bash -source ../snakemake_env/bin/activate -``` - -Now, you can move on and continue to the next section. - -### Usual run - -Dry run first to check that everything was installed correctly: +To check that everything is installed correctly, run a dry run first: ``` make dry-run ``` -To run the pipeline: -``` -make run -``` - -To test the whole pipeline end to end (it is supposed to run relatively quickly and does not train anything useful): +To execute the full pipeline, specify a specific profile and configuration file: -``` -make test -``` -You can also run a specific profile or config by overriding variables from Makefile ``` make run PROFILE=slurm-puhti CONFIG=configs/config.test.yml ``` ### Specific target -By default, all Snakemake rules are executed. To run the pipeline up to a specific rule use: +By default, all Snakemake rules are executed. To run the pipeline up to a specific rule, use: + ``` make run TARGET= ``` -For example, collect corpus first: + +For example, to collect the corpus first: + ``` make run TARGET=merge_corpus ``` -You can also use the full file path, for example: +You can also specify the full file path, such as: + ``` make run TARGET=/models/ru-en/bicleaner/teacher-base0/model.npz.best-ce-mean-words.npz ``` ### Rerunning -If you want to rerun a specific step or steps, you can delete the result files that are expected in the Snakemake rule output. -Snakemake might complain about a missing file and suggest to run it with `--clean-metadata` flag. In this case run: +If you need to rerun a specific step, delete the output files expected in the Snakemake rule. +If Snakemake reports a missing file and suggests running with the `--clean-metadata` flag, do the following: + ``` make clean-meta TARGET= ``` and then as usual: + ``` -make run +make run PROFILE= CONFIG= ``` ### Canceling -Be aware that if you cancel a pipeline that is currently running on a cluster, you also need to cancel the related SLURM jobs, as these won't be canceled automatically. You also need to delete the result files that you want to overwrite. \ No newline at end of file +If you need to cancel a running pipeline on a cluster, remember to also cancel the associated SLURM jobs, as these will not be canceled automatically. +Additionally, delete any resulting files that you want to overwrite. + +### On LUMI + +To run the pipeline on LUMI, start from the login node using your local copy of the root repository. + +First, start a tmux session. You can read more about [tmux](https://github.com/tmux/tmux/wiki) here. + +Load the LUMI-specific modules: + +```bash +module load CrayEnv +module load PrgEnv-cray/8.3.3 +module load craype-accel-amd-gfx90a +module load cray-python +module load rocm/5.3.3 +export SINGULARITYENV_LD_LIBRARY_PATH=$LD_LIBRARY_PATH +``` + +Activate the Snakemake environment: + +```bash +source ../snakemake_env/bin/activate +``` +You can now proceed as explained above. \ No newline at end of file