Skip to content

Commit

Permalink
Add student base configuration option (#881)
Browse files Browse the repository at this point in the history
* Add student base configuration option

* Reformat

* Add links to the experiment issue

* Update test configs

* Use templating in cache resources
  • Loading branch information
eu9ene authored Oct 18, 2024
1 parent bc20aa4 commit c588cdf
Show file tree
Hide file tree
Showing 16 changed files with 97 additions and 14 deletions.
26 changes: 18 additions & 8 deletions docs/training-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Considerations:
Copy the [example config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/tc.prod.yml) from the `/configs` directory to modify.

Then change the language pair and the name of the experiment:
```
```yaml
experiment:
name: test-quality
src: ru
Expand Down Expand Up @@ -56,7 +56,7 @@ task find-corpus.py -- en ru --importer mtdata
If the versions are the same I prefer OPUS ones as a more stable resource.

Copy the datasets in the training config:
```
```yaml
datasets:
train:
- opus_ada83/v1
Expand All @@ -79,7 +79,7 @@ python utils/find-corpus.py en ru sacrebleu
- Some OPUS and mtdata datasets provide dev and devtest versions, so it's a good idea to add them to evaluation.
- Make sure that training, validation and evaluation datasets are different.

```
```yaml
# datasets to merge for validation while training
devtest:
- flores_dev
Expand All @@ -106,7 +106,7 @@ The only limitation is probably available computational resources.
Find monolingual data and add it to `datasets.mono-src` and `datasets.mono-trg`.
Using [News Crawl](https://data.statmt.org/news-crawl/) datasets from statmt is preferable
because they are relatively clean, and the pipeline supports automatic downloading for them.
```
```yaml
# to be translated by the ensemble of teacher models
mono-src:
- news-crawl_news.2020
Expand All @@ -128,7 +128,7 @@ Find more details about the supported dataset importers [here](data.md).
## 3. Configure data cleaning

To use the default data cleaning pipeline set:
```
```yaml
use-opuscleaner: false
```

Expand All @@ -146,11 +146,21 @@ for example [`teacher.train.yml`] and in the [`train.py`] script.

### Model training

#### Student architecture

"base" or "tiny" based on [Bergamot configurations](https://github.com/browsermt/students/tree/master/train-student/models).
"tiny" is smaller and faster. "base" produces translations of higher quality.
```yaml
student-model: tiny
```

More details about the performed experiments are in [this issue](https://github.com/mozilla/firefox-translations-training/issues/174).

#### Early stopping
Early stopping can be increased to make sure that training converges.
However, it depends on the language and might not bring much benefit but will make the training longer.
So, you can start with `early-stopping: 20`, monitor the training and increase it if the model stops training too early.
```
```yaml
marian-args:
# these configs override pipeline/train/configs
training-backward:
Expand Down Expand Up @@ -183,7 +193,7 @@ More details:
`mini-batch-words` can be set depending on available GPU memory and the number of teachers.
It affects the batch size and decoding speed for the `translate` steps.
```
```yaml
marian-args:
...
decoding-backward:
Expand All @@ -198,7 +208,7 @@ marian-args:

Make sure to use it only for teacher models and on GPUs that support it.
It speeds up decoding but can slightly decrease quality.
```
```yaml
marian-args:
...
decoding-teacher:
Expand Down
27 changes: 27 additions & 0 deletions pipeline/train/configs/model/student.base.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# https://github.com/browsermt/students/tree/master/train-student/models/student.base
# the difference with the "tiny" configuration is the dimensionality of transformer-dim-ffn and dim-emb
dec-cell-base-depth: 2
dec-cell-high-depth: 1
dec-cell: ssru
dec-depth: 2
dim-emb: 512
dim-vocabs: [32000, 32000]
enc-cell-depth: 1
enc-cell: gru
enc-depth: 6
enc-type: bidirectional
tied-embeddings-all: true
transformer-decoder-autoreg: rnn
transformer-dim-ffn: 2048
transformer-ffn-activation: relu
transformer-ffn-depth: 2
transformer-guided-alignment-layer: last
transformer-heads: 8
transformer-no-projection: false
transformer-postprocess-emb: d
transformer-postprocess: dan
transformer-preprocess: ""
transformer-tied-layers: []
transformer-train-position-embeddings: false
type: transformer

File renamed without changes.
25 changes: 24 additions & 1 deletion pipeline/train/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@ class TrainingType(Enum):
train = "train"


class StudentModel(Enum):
none = "None"
tiny = "tiny"
base = "base"


class TeacherMode(Enum):
none = "None"
one_stage = "one-stage"
Expand Down Expand Up @@ -151,6 +157,7 @@ def __init__(self, args: Any, temp_dir: Path) -> None:
self.validation_set_prefix: str = args.validation_set_prefix
self.artifacts: Path = args.artifacts
self.model_type: ModelType = args.model_type
self.student_model: StudentModel = args.student_model
self.teacher_mode: TeacherMode = args.teacher_mode
self.training_type: TrainingType = args.training_type
self.best_model_metric: BestModelMetric = args.best_model_metric
Expand Down Expand Up @@ -178,6 +185,7 @@ def log_config(self):
logger.info(f" - validation_set_prefix: {self.validation_set_prefix}")
logger.info(f" - artifacts: {self.artifacts}")
logger.info(f" - model_type: {self.model_type.value}")
logger.info(f" - student_model: {self.student_model.value}")
logger.info(f" - teacher_mode: {self.teacher_mode.value}")
logger.info(f" - training_type: {self.training_type.value}")
logger.info(f" - best_model_metric: {self.best_model_metric}")
Expand Down Expand Up @@ -293,13 +301,20 @@ def get_marian_cmd(self):
extra_args.append("--sharding")
extra_args.append("local")

if self.model_type == ModelType.student:
if self.student_model == StudentModel.none:
raise ValueError("Student configuration is not provided")
model_name = f"student.{self.student_model.value}"
else:
model_name = self.model_type.value

return [
str(self.marian_bin),
*apply_command_args(
{
"model": self.artifacts / "model.npz",
"config": [
train_dir / f"configs/model/{self.model_type.value}.yml",
train_dir / f"configs/model/{model_name}.yml",
train_dir
/ f"configs/training/{self.model_type.value}.{self.training_type.value}.yml",
],
Expand Down Expand Up @@ -370,6 +385,14 @@ def main() -> None:
required=True,
help="The type of model to train",
)
parser.add_argument(
"--student_model",
type=StudentModel,
choices=StudentModel,
required=False,
default=StudentModel.tiny,
help="Type of student model",
)
parser.add_argument(
"--training_type",
type=TrainingType,
Expand Down
1 change: 1 addition & 0 deletions taskcluster/configs/config.ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ experiment:
opuscleaner-mode: "custom"
teacher-mode: "two-stage"
corpus-max-sentences: 1000
student-model: "tiny"

bicleaner:
default-threshold: 0.5
Expand Down
4 changes: 4 additions & 0 deletions taskcluster/configs/config.prod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ experiment:
# tokenization training.
spm-sample-size: 10_000_000
# The size of the vocabulary
#TODO: this does not change the vocab size in Marian configurations now, only in sentencepiece
spm-vocab-size: 32000

# Determine how many teachers to train.
Expand All @@ -73,6 +74,9 @@ experiment:
# Switch to "one-stage" training if back-translations are produced by a high quality model or
# the model stops too early on the fine-tuning stage
teacher-mode: "two-stage"
# Two student training configurations from Bergamot are supported: "tiny" and "base"
# "base" model is twice slower and larger but adds ~2 COMET points in quality (see https://github.com/mozilla/firefox-translations-training/issues/174)
student-model: "tiny"

# Training continuation options, see docs/using-pretrained-models.md
pretrained-models:
Expand Down
4 changes: 3 additions & 1 deletion taskcluster/kinds/finetune-student/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ tasks:
src_locale: training_config.experiment.src
trg_locale: training_config.experiment.trg
best_model: training_config.experiment.best-model
student_model: training_config.experiment.student-model
wandb_publication: training_config.wandb-publication
owner: owner
substitution-fields:
Expand All @@ -47,7 +48,7 @@ tasks:
cache:
type: finetune-student
resources:
- pipeline/train/configs/model/student.yml
- pipeline/train/configs/model/student.{student_model}.yml
- pipeline/train/configs/opustrainer/student.yml
- pipeline/train/configs/training/student.train.yml
- pipeline/train/train.py
Expand Down Expand Up @@ -110,6 +111,7 @@ tasks:
$MOZ_FETCHES_DIR/corpus.aln.zst
0
None
{student_model}
None
None
--pretrained-model
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/train-backwards/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ tasks:
None
0
None
None
{pretrained_backward_mode}
{pretrained_backward_type}
{marian_args}
Expand Down
4 changes: 3 additions & 1 deletion taskcluster/kinds/train-student/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ tasks:
best_model: training_config.experiment.best-model
src_locale: training_config.experiment.src
trg_locale: training_config.experiment.trg
student_model: training_config.experiment.student-model
wandb_publication: training_config.wandb-publication
owner: owner
substitution-fields:
Expand All @@ -46,7 +47,7 @@ tasks:
cache:
type: train-student
resources:
- pipeline/train/configs/model/student.yml
- pipeline/train/configs/model/student.{student_model}.yml
- pipeline/train/configs/opustrainer/student.yml
- pipeline/train/configs/training/student.train.yml
- pipeline/train/train.py
Expand Down Expand Up @@ -111,6 +112,7 @@ tasks:
$MOZ_FETCHES_DIR/corpus.aln.zst
0
None
{student_model}
None
None
{marian_args}
Expand Down
1 change: 1 addition & 0 deletions taskcluster/kinds/train-teacher/kind.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ tasks:
$MOZ_FETCHES_DIR/corpus.aln.zst,$MOZ_FETCHES_DIR/mono.aln.zst
{{this_chunk}}
{teacher_mode}
None
{pretrained_teacher_mode}
{pretrained_teacher_type}
{marian_args}
Expand Down
8 changes: 5 additions & 3 deletions taskcluster/scripts/pipeline/train-taskcluster.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,10 @@ best_model_metric=$8
alignments=$9
seed=${10}
teacher_mode=${11}
pretrained_model_mode=${12}
pretrained_model_type=${13}
extra_marian_args=( "${@:14}" )
student_model=${12}
pretrained_model_mode=${13}
pretrained_model_type=${14}
extra_marian_args=( "${@:15}" )

if [ "$pretrained_model_mode" != "use" ]; then
# MOZ_FETCHES_DIR is not required for the "use" pretrained model mode
Expand Down Expand Up @@ -53,6 +54,7 @@ case "$pretrained_model_mode" in
fi
python3 $VCS_ROOT/pipeline/train/train.py \
--model_type "$model_type" \
--student_model "$student_model" \
--training_type "$training_type" \
--src "$src" \
--trg "$trg" \
Expand Down
1 change: 1 addition & 0 deletions taskcluster/test/params/large-lt-en.yml
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ training_config:
src: lt
teacher-ensemble: 2
teacher-mode: 'two-stage'
student-model: 'base'
trg: en
use-opuscleaner: 'false'
vocab: NOT-YET-SUPPORTED
Expand Down
1 change: 1 addition & 0 deletions taskcluster/test/params/small-ru-en.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ training_config:
src: ru
teacher-ensemble: 1
teacher-mode: 'two-stage'
student-model: 'tiny'
trg: en
use-opuscleaner: 'true'
vocab: NOT-YET-SUPPORTED
Expand Down
6 changes: 6 additions & 0 deletions taskcluster/translations_taskgraph/actions/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,12 @@ def validate_pretrained_models(params):
"enum": ["one-stage", "two-stage"],
"default": "two-stage",
},
"student-model": {
"type": "string",
"description": "Student model configuration",
"enum": ["tiny", "base"],
"default": "tiny",
},
"mono-max-sentences-src": {
"type": "object",
"default": defaults["experiment"]["mono-max-sentences-src"],
Expand Down
1 change: 1 addition & 0 deletions taskcluster/translations_taskgraph/parameters.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def get_ci_training_config(_=None) -> dict:
Required("trg"): str,
Required("teacher-ensemble"): int,
Required("teacher-mode"): str,
Required("student-model"): str,
Optional("corpus-max-sentences"): int,
Required("mono-max-sentences-trg"): {
Required("total"): int,
Expand Down
1 change: 1 addition & 0 deletions tests/fixtures/config.pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ experiment:
spm-sample-size: 10_000_000
teacher-ensemble: 2
teacher-mode: "two-stage"
student-model: "tiny"
backward-model: NOT-YET-SUPPORTED
vocab: NOT-YET-SUPPORTED
datasets:
Expand Down

0 comments on commit c588cdf

Please sign in to comment.