Skip to content

Commit

Permalink
added docs
Browse files Browse the repository at this point in the history
  • Loading branch information
onadegibert committed Sep 17, 2024
1 parent 05d4546 commit 17fbeb6
Show file tree
Hide file tree
Showing 10 changed files with 298 additions and 166 deletions.
8 changes: 3 additions & 5 deletions configs/config.quickstart.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ experiment:

best-model: perplexity

opusfilter:
config: default #Otherwise, specify path to opusfilter configuration 'configs/opusfilter/config.opusfilter.yaml'

marian-args:
training-student:
disp-freq: 10
Expand All @@ -30,11 +33,6 @@ marian-args:
save-freq: 100
valid-freq: 100
after: 500u
decoding-backward:
mini-batch-words: 2000
decoding-teacher:
mini-batch-words: 1000
precision: float16

datasets:
train:
Expand Down
140 changes: 94 additions & 46 deletions docs/configs/configuration_files.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
# Configuration files
# Configuration Files

Configuration files are in [YAML](https://yaml.org/) format.
At the top level, they have two sections:
The configuration files for OpusDistillery are written in [YAML](https://yaml.org/) format and are divided into two main sections:

* `experiment`: contains all the relevant information for your experiment, except the information on which datasets to use.
* `datasets`: contains the infromation regarding the datasets used for training, development and evaluation. Datasets are explained in [Dataset importers](downloading_and_selecting_data.md).
- **`experiment`**: Contains the general setup and parameters for the experiment, excluding dataset information.
- **`datasets`**: Specifies the datasets used for training, development, and evaluation. Details about datasets can be found in [Dataset Importers](downloading_and_selecting_data.md).

At the beginning of your `experiment` section, you should define the following:
### Experiment Setup

* `dirname`: directory name where everything will be stored.
* `name`: name of the experiment you are running. All generated data and models will be stored in `dirname`/`name`
* `langpairs`: a list of the language pairs you want in your student model, with **two letter codes**
In the `experiment` section, the following key parameters must be defined:

```yaml
- **`dirname`**: The directory where all experiment outputs will be stored.
- **`name`**: The name of the experiment. All generated data and models will be saved under `dirname`/`name`.
- **`langpairs`**: A list of language pairs for the student model, using ISO two-letter language codes.

Example configuration:
```yaml
experiment:
dirname: test
name: fiu-eng
Expand All @@ -27,42 +28,45 @@ experiment:
### OpusFilter
We have added support for using [OpusFilter](https://github.com/Helsinki-NLP/OpusFilter), a tool for filtering and combining parallel corpora. For data filtering, instead of the default cleaning, you can choose to use opusfilter with a default configuration or with a specific configuration you provide.
OpusDistillery supports [OpusFilter](https://github.com/Helsinki-NLP/OpusFilter), a tool for filtering and combining parallel corpora. Instead of the default cleaning, you can choose to filter data using OpusFilter with either a default configuration or a custom configuration that you provide.
In the configuration file, if you want to use a [default](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/pipeline/clean/run-opusfilter.py#13) configuration, you can see how in [this example](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/configs/opusfilter/config.fiu-eng.opusfilter.yml#L33). Otherwise, you can specify the path to a specific file with an Opusfilter configuration such as [this one](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/configs/opusfilter/config.opusfilter.yml).
In the configuration file, if you want to use a default configuration, see this [example](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/pipeline/clean/run-opusfilter.py#13).
Otherwise, you can specify the path to a custom OpusFilter configuration file such as [this one](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/configs/opusfilter/config.opusfilter.yml).
```yaml
opusfilter:
config: default # Otherwise, specify path to opusfilter configuration 'configs/opusfilter/config.opusfilter.yaml'
config: default # # Or specify the path to an OpusFilter configuration file
```
### Bicleaner AI
At the moment, this is not working.
Currently, Bicleaner AI is not operational. Do you want to collaborate? Feel free to work on this [issue](https://github.com/Helsinki-NLP/OpusDistillery/issues/3).
## Teacher models
You can choose a teacher from OPUS-MT or from Hugging Face (beware, runs on CPU!).
You can select a teacher model from OPUS-MT or Hugging Face.
### OPUS-MT Teachers
It is defined by:
To specify an OPUS-MT teacher, use:
* `opusmt-teacher`

This can be either of the following:
1. the URL to an OPUS-MT model
It can be one of the following:

1. A URL to an OPUS-MT model:

```yaml
opusmt-teacher: "https://object.pouta.csc.fi/Tatoeba-MT-models/fiu-eng/opus4m-2020-08-12.zip"
```

2. the path to an OPUS-MT model,
2. A path to a local OPUS-MT model:

```yaml
opusmt-teacher: "/path/to/opus-mt/model"
```

3. a list of OPUS-MT models that will be used all together (any combination of the previous two).
3. A list of OPUS-MT models:

```yaml
opusmt-teacher:
Expand All @@ -71,7 +75,7 @@ This can be either of the following:
```


4. In the case of multilingual students, you can combine different teachers. In this case, it should be a dictionary, specifying each teacher per language pair.
4. For multilingual students, specify different teachers for each language pair:

```yaml
opusmt-teacher:
Expand All @@ -80,34 +84,76 @@ This can be either of the following:
en-be: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-bel/opus+bt-2021-03-07.zip"
```

5. `best` which will select the best teacher available for each language pair by checking the FLORES200+ scores from the [OPUS-MT dashboard](https://opus.nlpl.eu/dashboard).
5. Use the `best` option to automatically select the best teacher for each language pair, based on FLORES200+ scores from the [OPUS-MT dashboard](https://opus.nlpl.eu/dashboard/).

```yaml
opusmt-teacher: "best"
```

### Hugging Face Teachers

You can also use a [Hugging Face](https://huggingface.co/) model as a teacher.

* `modelname`: The model identifier from the Hugging Face hub.
* `modelclass`: The class of the model being loaded.

```yaml
huggingface:
modelname: "Helsinki-NLP/opus-mt-mul-en"
modelclass: "transformers.AutoModelForSeq2SeqLM"
```

You can also configure the decoding options:

```yaml
huggingface:
modelname: "HPLT/translate-et-en-v1.0-hplt_opus"
modelclass: "transformers.AutoModelForSeq2SeqLM"
config:
top_k: 50
top_p: 0.90
temperature: 0.1
max_new_tokens: 128
```

For models that use language tags, additional parameters are required:

* `lang_info`: Set to True if language tags are needed.
* `lang_tags`: A mapping of language codes to the tags used by the model.

```yaml
huggingface:
modelname: "facebook/nllb-200-distilled-600M"
modelclass: "transformers.AutoModelForSeq2SeqLM"
lang_info: True
lang_tags:
en: eng_Latn
et: est_Latn
```

It is defined like:
Finally, for models requiring a prompt, you can define it like this:

```yaml
huggingface:
model: "facebook/nllb-200-distilled-600M"
task: translation #if not in config, assumes "translation by default"
modelname: "google-t5/t5-small"
modelclass: "transformers.AutoModelForSeq2SeqLM"
lang_tags:
en: English
de: German
prompt: "Translate {src_lang} to {tgt_lang}: {source}"
```

Where model is the identifier from the hub and the task is a sequence-to-sequence task that produces translations with the pipeline implementation.
In this case, the lang_tags mapping will be used in the prompt.

When using a HF model as teacher, there is no scoring and no cross-entropy filtering.
Note: When using a Hugging Face model as a teacher, there is no scoring or cross-entropy filtering.

## Backward models

At the moment, the type of backward models available are only OPUS-MT.
Currently, only OPUS-MT models are available as backward models for scoring translations.

It is defined by:
To specify a backward model, use:

* `opusmt-backward`: the URL or path to an OPUS-MT model to be used as a backward model for scoring translations. As the teacher, it can also be a dictionary specifying a backward model per language pair as well as `best`.
* `opusmt-backward`: The URL or path to an OPUS-MT model. Like the teacher models, this can also be a dictionary for multilingual students or `best`.

```yaml
opusmt-backward:
Expand All @@ -119,27 +165,27 @@ It is defined by:
If left empty, the cross-entropy filtering step will be skipped.

## Multilinguality
Specify if the teacher, the backward and the student models are many-to-one to be able to deal properly with language tags. By default, this is `False`.
Specify whether the teacher, backward, and student models are many-to-one to properly handle language tags. By default, this is set to `False`.

* `one2many-teacher`: `True` or `False` (default). If `opusmt-teacher` is "best", then this should be also "best"
* `one2many-backward`: `True` or `False` (default). If `opusmt-backward` is "best", then this should be also "best"
* `one2many-teacher`: `True` or `False` (default). If `opusmt-teacher` is set to `best`, this should also be `best`.
* `one2many-backward`: `True` or `False` (default). If `opusmt-backward` is set to `best`, this should also be `best`.
* `one2many-student`: `True` or `False` (default).

```yaml
# Specify if the teacher and the student are many2one
# Specify if the teacher and the student are one2many
one2many-teacher: True
one2many-student: True
```
## Training

### Marian arguments
These configs override pipeline/train/configs with [Marian settings](https://marian-nmt.github.io/docs/cmd/marian/)
You can override default pipeline settings with [Marian-specific settings](https://marian-nmt.github.io/docs/cmd/marian/).

The options are: `training-teacher`, `decoding-teacher`,`training-backward`, `decoding-backward`,`training-student`, `training-student-finetuned`
You can use the following options: `training-teacher`, `decoding-teacher`,`training-backward`, `decoding-backward`,`training-student`, `training-student-finetuned`.

```yaml
marian-args:
#these configs override pipeline/train/configs
# These configs override pipeline/train/configs
training-student:
dec-depth: 3
enc-depth: 3
Expand All @@ -160,11 +206,12 @@ The options are: `training-teacher`, `decoding-teacher`,`training-backward`, `de

### Opustrainer

We have also added support for using [OpusTrainer](https://github.com/hplt-project/OpusTrainer), a tool for curriculum training and data augmentation.
OpusDistillery supports [OpusTrainer](https://github.com/hplt-project/OpusTrainer) for curriculum training and data augmentation.

In the configuration file, you can specify a path to the OpusTrainer configuration as in [here](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opustrainer/config.fiu-eng.opustrainer.yml#L37). However, this assumes that you already now the final paths of the data as specified in [here](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opustrainer/config.fiu-eng.opustrainer.stages.yml).
You can specify a path to the OpusTrainer configuration, such as in [this example](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opustrainer/config.fiu-eng.opustrainer.yml#L37).
This assumes you know the final paths of the data, as defined in [this file](https://github.com/Helsinki-NLP/OpusDistillery/blob/multi-ftt/configs/opustrainer/config.fiu-eng.opustrainer.stages.yml).

At the moment, this is only implemented for student training.
Currently, this is implemented only for student training.

```yaml
opustrainer:
Expand All @@ -174,16 +221,17 @@ At the moment, this is only implemented for student training.

## Exporting

The final student model is in the Bergamot format, which makes use of shortlists for training (and these shorlists are trained using alignments). For that reason, we have also implemented the option to only train a student with the tiny architecture without the guided alignment. For that purpose, the user needs to specify "export" in the configuration file like this:
The final student model is exported in the Bergamot format, which uses shortlists for training. Shortlists are trained using alignments, so there's an option to train a student without guided alignment using the tiny architecture. To disable this, specify export in the configuration file:

```yaml
export: "no"
```

### Other

* `parallel-max-sentences`: maximum parallel sentences to download from each dataset.
* `split-length`: the amount of sentences into which you want to split your training data for forward translation.
* `best-model`: metric to select your best model.
* `spm-sample-size`: sample size to train spm vocabulary of the student.
* `student-prefix`: in case you want to train multiple students with exactly the same data, you can add this prefix which will allow you to train multiple students in the same directory structure. Find more about the directory structure [here](../pipeline/dir_structure.md).
* `parallel-max-sentences`: Maximum parallel sentences to download from each dataset.
* `split-length`: The number of sentences into which you want to split your training data for forward translation.
* `best-model`: Metric used to select the best model.
* `spm-sample-size`: Sample size for training the student’s SPM vocabulary.
* `spm-vocab-size`: Vocabulary size for training the student’s SPM vocabulary.
* `student-prefix`: To train multiple students with the same data, add a prefix to the student name, which will allow multiple students to be trained under the same directory structure with the same data. More details on the directory structure can be found [here](../pipeline/dir_structure.md).
7 changes: 3 additions & 4 deletions docs/configs/downloading_and_selecting_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,8 @@ Make sure to check licenses of the datasets before using them.

## Adding a new importer

Just add a shell script to [corpus](https://github.com/Helsinki-NLP/OpusDistillery/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/Helsinki-NLP/OpusDistillery/tree/main/pipeline/data/importers/mono) which is named as `<prefix>.sh`
and accepts the same parameters as the other scripts from the same folder.
Just add a shell script to [corpus](https://github.com/Helsinki-NLP/OpusDistillery/tree/main/pipeline/data/importers/corpus) or [mono](https://github.com/Helsinki-NLP/OpusDistillery/tree/main/pipeline/data/importers/mono) which is named as `<prefix>.sh` and accepts the same parameters as the other scripts from the same folder.

## Issues
- Currently, it is not possible to download specific datasets per language pair, right now the tool only downloads the same dataset for all language pairs. If a dataset doesn't exist for a given language pair, it creates dummy files.
- Currently, there is no support to download monolingual datasets. The use of monolingual data is not implemented and only supports the use of bilingual data at the moment.
* Currently, it is not possible to download specific datasets per language pair; the tool downloads the same dataset for all language pairs. If a dataset doesn't exist for a given language pair, dummy files are created. Do you want to collaborate? Feel free to work on this [issue](https://github.com/Helsinki-NLP/OpusDistillery/issues/1).
* There is currently no support for downloading monolingual datasets. The use of monolingual data is not fully implemented; only bilingual data is supported at this time. Do you want to collaborate? Feel free to work on this [issue](https://github.com/Helsinki-NLP/OpusDistillery/issues/2).
Loading

0 comments on commit 17fbeb6

Please sign in to comment.