Skip to content

Commit

Permalink
adding more info to docs
Browse files Browse the repository at this point in the history
  • Loading branch information
onadegibert committed May 15, 2024
1 parent d359e76 commit 3ffee12
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 7 deletions.
9 changes: 5 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ OpusDistillery
Welcome to OpusDistillery's documentation!

OpusDistillery is an end-to-end pipeline to perform systematic multilingual distillation of MT models.
It is built on top of the `Firefox Translations Training pipeline <https://github.com/mozilla/firefox-translations-training>`,
originally developed within the `Bergamot project<https://browser.mt>`, for training efficient NMT models that can run locally in a web browser.
It is built on top of the `Firefox Translations Training pipeline <https://github.com/mozilla/firefox-translations-training>`_,
originally developed within the `Bergamot project<https://browser.mt>`_, for training efficient NMT models that can run locally in a web browser.

The pipeline is capable of training a translation model for any language pair(s) end to end.
Translation quality depends on the chosen datasets, data cleaning procedures and hyperparameters. Some settings, especially low resource languages might require extra tuning.

We use `Marian<https://marian-nmt.github.io/>`, the fast neural machine translation engine .
We use `Marian<https://marian-nmt.github.io/>`_, the fast neural machine translation engine .

New features:

Expand All @@ -22,4 +22,5 @@ New features:
:caption: Get started
:maxdepth: 1

installation.md
installation.md
usage.md
8 changes: 5 additions & 3 deletions docs/installation.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Getting started on CSC's puhti and mahti
# Installation

## Getting started on CSC's puhti and mahti
1. Clone the repository.
2. Download the Ftt.sif container to the repository root.
3. Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository):
Expand All @@ -11,9 +13,9 @@
7. If the data directory is not located in the parent directory of the repository, edit _profiles/slurm-puhti/config.yaml_ or _profiles/slurm-mahti/config.yaml_ and change the bindings in the singularity-args section to point to your data directory, and also enter the _data_ directory path as the _root_ value of the _config_ section.
8. Edit profiles/slurm-puhti/config.cluster.yaml to change the CSC account to one you have access to.
9. Load cuda modules: module load gcc/9.4.0 cuda cudnn
10. Run pipeline: `make run-hpc PROFILE="slurm-puhti"` or `make run PROFILE="slurm-mahti"`
10. Run pipeline: `make run-hpc PROFILE="slurm-puhti"` or `make run PROFILE="slurm-mahti"`. More information in [Basic Usage](usage.md)

# Getting started on CSC's lumi
## Getting started on CSC's lumi
1. Clone the repository.
2. Download the Ftt.sif container to the repository root.
3. Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository):
Expand Down
53 changes: 53 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Basic usage

## Running

Load all the necessary modules as explained in [Installation](installation.md)

Dry run first to check that everything was installed correctly:

```
make dry-run
```

To run the pipeline:
```
make run
```

To test the whole pipeline end to end (it is supposed to run relatively quickly and does not train anything useful):

```
make test
```
You can also run a speicific profile or config by overriding variables from Makefile
```
make run PROFILE=slurm-moz CONFIG=configs/config.test.yml
```

### Specific target

By default, all Snakemake rules are executed. To run the pipeline up to a specific rule use:
```
make run TARGET=<non-wildcard-rule-or-path>
```
For example, collect corpus first:
```
make run TARGET=merge_corpus
```

You can also use the full file path, for example:
```
make run TARGET=/models/ru-en/bicleaner/teacher-base0/model.npz.best-ce-mean-words.npz
```
### Rerunning

If you want to rerun a specific step or steps, you can delete the result files that are expected in the Snakemake rule output.
Snakemake might complain about a missing file and suggest to run it with `--clean-metadata` flag. In this case run:
```
make clean-meta TARGET=<missing-file-name>
```
and then as usual:
```
make run
```

0 comments on commit 3ffee12

Please sign in to comment.