Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
eu9ene authored Oct 28, 2021
1 parent ef8928b commit a09b0ac
Showing 1 changed file with 16 additions and 14 deletions.
30 changes: 16 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,9 +67,9 @@ cd firefox-translations-training
For Slurm: `profiles/slurm/config.yml` and `profiles/slurm/config.cluster.yml`
You can also modify `profiles/slurm/submit.sh` or create a new Snakemake [profile](https://github.com/Snakemake-Profiles).
5. (Cluster mode) It might require further tuning of requested resources in `Snakemake` file:
- Use `threads` for a rule to adjust parallelism
- Use `resources: mem_mb=<memory>` to adjust total memory requirements per task
(default is set in `profile/slurm/config.yaml`)
- Use `threads` for a rule to adjust parallelism
- Use `resources: mem_mb=<memory>` to adjust total memory requirements per task
(default is set in `profile/slurm/config.yaml`)

## Installation

Expand Down Expand Up @@ -196,7 +196,7 @@ See `Snakefile` file for directory structure documentation.

The main directories inside `SHARED_ROOT` are:
- `data/<lang_pair>/<experiment>` - data produced by the pipeline jobs
- `logs/<lang_pair>/<experiment>` - logs of pipeline jobs for troubleshooting
- `logs/<lang_pair>/<experiment>` - logs of the jobs for troubleshooting
- `experiments/<lang_pair>/<experiment>` - saved experiment settings for future reference
- `models/<lang_pair>/<experiment>` - all models produced by the pipeline. The final compressed models are in `exported` folder.

Expand Down Expand Up @@ -224,12 +224,13 @@ Export | Exports trained model and shortlist to (bergamot-translator)(https://gi

## Datasets importers

Dataset importers can be used in `TRAIN_DATASETS, DEVTEST_DATASETS, MONO_DATASETS_SRC, MONO_DATASETS_TRG` config settings.
Dataset importers can be used in `datasets` sections of experiment config.

Example:
```
TRAIN_DATASETS="opus_OPUS-ParaCrawl/v7.1 mtdata_newstest2019_ruen"
TEST_DATASETS="sacrebleu_wmt20 sacrebleu_wmt18"
train:
- opus_ada83/v1
- mtdata_newstest2014_ruen
```

Data source | Prefix | Name examples | Type | Comments
Expand Down Expand Up @@ -259,14 +260,15 @@ and accepts the same parameters as the other scripts from the same folder.

### Architecture

All steps are independent and contain scripts that accept input arguments, read input files from disk and output the results on disk.
It allows to write the steps in any language (currently it's historically mostly bash and Python) and
represent the pipeline as a DAG to be compatible with workflow managers.
All steps are independent and contain scripts that accept arguments, read input files from disk and output the results to disk.
It allows writing the steps in any language (currently it's historically mostly bash and Python) and
represent the pipeline as directed acyclic graph (DAG).

The main script `run.sh` can be easily replaced with a DAG definition in workflow manager terms.
A workflow manager will provide easy resource management, parallelization, monitoring and scheduling which will allow horizontal scalability required to train massive number of langauges.
Snakemake workflow manager infers the DAG implicitly from the specified inputs and outputs of the steps. The workflow manager checks which files are missing and runs the corresponding jobs either locally or on a cluster depending on configuration.

At the same time it is possible to run it all locally end to end or to do interactive experimentation running specific scripts manually.
Snakemake parallelizes steps that can be executed simultniously. It is especially usefull for teacher ensemble training and translation.

The main snakemkae process (scheduler) should be launched interactively. It runs job processes on the worker nodes in cluster mode or on a local machine in local mode.

### Conventions

Expand Down Expand Up @@ -313,4 +315,4 @@ Brussels, Belgium: Association for Computational Linguistics, October 2018
in *Proceedings of the 22nd Annual Conference of the European Association for Machine Translation*.
Lisboa, Portugal: European Association for Machine Translation, November 2020

3. Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33.
3. Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33.

0 comments on commit a09b0ac

Please sign in to comment.