Add student base configuration option (#881)

* Add student base configuration option * Reformat * Add links to the experiment issue * Update test configs * Use templating in cache resources
mozilla · Oct 18, 2024 · c588cdf · c588cdf
1 parent bc20aa4
commit c588cdf
Show file tree

Hide file tree

Showing 16 changed files with 97 additions and 14 deletions.
diff --git a/docs/training-guide.md b/docs/training-guide.md
@@ -27,7 +27,7 @@ Considerations:
 Copy the [example config](https://github.com/mozilla/firefox-translations-training/tree/main/configs/tc.prod.yml) from the `/configs` directory to modify.
 
 Then change the language pair and the name of the experiment:
-```
+```yaml
 experiment:
   name: test-quality
   src: ru
@@ -56,7 +56,7 @@ task find-corpus.py -- en ru --importer mtdata
    If the versions are the same I prefer OPUS ones as a more stable resource.
 
 Copy the datasets in the training config:
-```
+```yaml
 datasets:
   train:
     - opus_ada83/v1
@@ -79,7 +79,7 @@ python utils/find-corpus.py en ru sacrebleu
 - Some OPUS and mtdata datasets provide dev and devtest versions, so it's a good idea to add them to evaluation.
 - Make sure that training, validation and evaluation datasets are different.
 
-```
+```yaml
   # datasets to merge for validation while training
   devtest:
     - flores_dev
@@ -106,7 +106,7 @@ The only limitation is probably available computational resources.
 Find monolingual data and add it to `datasets.mono-src` and `datasets.mono-trg`. 
 Using [News Crawl](https://data.statmt.org/news-crawl/) datasets from statmt is preferable
 because they are relatively clean, and the pipeline supports automatic downloading for them.
-```
+```yaml
   # to be translated by the ensemble of teacher models
   mono-src:
     - news-crawl_news.2020
@@ -128,7 +128,7 @@ Find more details about the supported dataset importers [here](data.md).
 ## 3. Configure data cleaning
 
 To use the default data cleaning pipeline set:
-```
+```yaml
   use-opuscleaner: false
 ```
 
@@ -146,11 +146,21 @@ for example [`teacher.train.yml`] and in the [`train.py`] script.
 
 ### Model training
 
+#### Student architecture
+
+"base" or "tiny" based on [Bergamot configurations](https://github.com/browsermt/students/tree/master/train-student/models).
+"tiny" is smaller and faster. "base" produces translations of higher quality.
+```yaml
+student-model: tiny
+```
+
+More details about the performed experiments are in [this issue](https://github.com/mozilla/firefox-translations-training/issues/174).
+
 #### Early stopping
 Early stopping can be increased to make sure that training converges.
 However, it depends on the language and might not bring much benefit but will make the training longer.
 So, you can start with `early-stopping: 20`, monitor the training and increase it if the model stops training too early.
-```
+```yaml
 marian-args:
 # these configs override pipeline/train/configs
   training-backward:
@@ -183,7 +193,7 @@ More details:
 
 `mini-batch-words` can be set depending on available GPU memory and the number of teachers.
 It affects the batch size and decoding speed for the `translate` steps.
-```
+```yaml
 marian-args:
 ...
   decoding-backward:
@@ -198,7 +208,7 @@ marian-args:
 
 Make sure to use it only for teacher models and on GPUs that support it.
 It speeds up decoding but can slightly decrease quality.
-```
+```yaml
 marian-args:
 ...
   decoding-teacher:

diff --git a/pipeline/train/configs/model/student.base.yml b/pipeline/train/configs/model/student.base.yml
@@ -0,0 +1,27 @@
+# https://github.com/browsermt/students/tree/master/train-student/models/student.base
+# the difference with the "tiny" configuration is the dimensionality of transformer-dim-ffn and dim-emb
+dec-cell-base-depth: 2
+dec-cell-high-depth: 1
+dec-cell: ssru
+dec-depth: 2
+dim-emb: 512
+dim-vocabs: [32000, 32000]
+enc-cell-depth: 1
+enc-cell: gru
+enc-depth: 6
+enc-type: bidirectional
+tied-embeddings-all: true
+transformer-decoder-autoreg: rnn
+transformer-dim-ffn: 2048
+transformer-ffn-activation: relu
+transformer-ffn-depth: 2
+transformer-guided-alignment-layer: last
+transformer-heads: 8
+transformer-no-projection: false
+transformer-postprocess-emb: d
+transformer-postprocess: dan
+transformer-preprocess: ""
+transformer-tied-layers: []
+transformer-train-position-embeddings: false
+type: transformer
+
diff --git a/pipeline/train/configs/model/student.yml → ...line/train/configs/model/student.tiny.yml b/pipeline/train/configs/model/student.yml → ...line/train/configs/model/student.tiny.yml
diff --git a/pipeline/train/train.py b/pipeline/train/train.py
@@ -31,6 +31,12 @@ class TrainingType(Enum):
     train = "train"
 
 
+class StudentModel(Enum):
+    none = "None"
+    tiny = "tiny"
+    base = "base"
+
+
 class TeacherMode(Enum):
     none = "None"
     one_stage = "one-stage"
@@ -151,6 +157,7 @@ def __init__(self, args: Any, temp_dir: Path) -> None:
         self.validation_set_prefix: str = args.validation_set_prefix
         self.artifacts: Path = args.artifacts
         self.model_type: ModelType = args.model_type
+        self.student_model: StudentModel = args.student_model
         self.teacher_mode: TeacherMode = args.teacher_mode
         self.training_type: TrainingType = args.training_type
         self.best_model_metric: BestModelMetric = args.best_model_metric
@@ -178,6 +185,7 @@ def log_config(self):
         logger.info(f" - validation_set_prefix: {self.validation_set_prefix}")
         logger.info(f" - artifacts: {self.artifacts}")
         logger.info(f" - model_type: {self.model_type.value}")
+        logger.info(f" - student_model: {self.student_model.value}")
         logger.info(f" - teacher_mode: {self.teacher_mode.value}")
         logger.info(f" - training_type: {self.training_type.value}")
         logger.info(f" - best_model_metric: {self.best_model_metric}")
@@ -293,13 +301,20 @@ def get_marian_cmd(self):
             extra_args.append("--sharding")
             extra_args.append("local")
 
+        if self.model_type == ModelType.student:
+            if self.student_model == StudentModel.none:
+                raise ValueError("Student configuration is not provided")
+            model_name = f"student.{self.student_model.value}"
+        else:
+            model_name = self.model_type.value
+
         return [
             str(self.marian_bin),
             *apply_command_args(
                 {
                     "model": self.artifacts / "model.npz",
                     "config": [
-                        train_dir / f"configs/model/{self.model_type.value}.yml",
+                        train_dir / f"configs/model/{model_name}.yml",
                         train_dir
                         / f"configs/training/{self.model_type.value}.{self.training_type.value}.yml",
                     ],
@@ -370,6 +385,14 @@ def main() -> None:
         required=True,
         help="The type of model to train",
     )
+    parser.add_argument(
+        "--student_model",
+        type=StudentModel,
+        choices=StudentModel,
+        required=False,
+        default=StudentModel.tiny,
+        help="Type of student model",
+    )
     parser.add_argument(
         "--training_type",
         type=TrainingType,

diff --git a/taskcluster/configs/config.ci.yml b/taskcluster/configs/config.ci.yml
@@ -24,6 +24,7 @@ experiment:
   opuscleaner-mode: "custom"
   teacher-mode: "two-stage"
   corpus-max-sentences: 1000
+  student-model: "tiny"
 
   bicleaner:
     default-threshold: 0.5

diff --git a/taskcluster/configs/config.prod.yml b/taskcluster/configs/config.prod.yml
@@ -65,6 +65,7 @@ experiment:
   # tokenization training.
   spm-sample-size: 10_000_000
   # The size of the vocabulary
+  #TODO: this does not change the vocab size in Marian configurations now, only in sentencepiece
   spm-vocab-size: 32000
 
   # Determine how many teachers to train.
@@ -73,6 +74,9 @@ experiment:
   # Switch to "one-stage" training if back-translations are produced by a high quality model or
   # the model stops too early on the fine-tuning stage
   teacher-mode: "two-stage"
+  # Two student training configurations from Bergamot are supported: "tiny" and "base"
+  # "base" model is twice slower and larger but adds ~2 COMET points in quality (see  https://github.com/mozilla/firefox-translations-training/issues/174)
+  student-model: "tiny"
 
   # Training continuation options, see docs/using-pretrained-models.md
   pretrained-models:

diff --git a/taskcluster/kinds/finetune-student/kind.yml b/taskcluster/kinds/finetune-student/kind.yml
@@ -29,6 +29,7 @@ tasks:
                 src_locale: training_config.experiment.src
                 trg_locale: training_config.experiment.trg
                 best_model: training_config.experiment.best-model
+                student_model: training_config.experiment.student-model
                 wandb_publication: training_config.wandb-publication
                 owner: owner
             substitution-fields:
@@ -47,7 +48,7 @@ tasks:
             cache:
                 type: finetune-student
                 resources:
-                    - pipeline/train/configs/model/student.yml
+                    - pipeline/train/configs/model/student.{student_model}.yml
                     - pipeline/train/configs/opustrainer/student.yml
                     - pipeline/train/configs/training/student.train.yml
                     - pipeline/train/train.py
@@ -110,6 +111,7 @@ tasks:
                     $MOZ_FETCHES_DIR/corpus.aln.zst
                     0
                     None
+                    {student_model}
                     None
                     None
                     --pretrained-model

diff --git a/taskcluster/kinds/train-backwards/kind.yml b/taskcluster/kinds/train-backwards/kind.yml
@@ -115,6 +115,7 @@ tasks:
                     None
                     0
                     None
+                    None
                     {pretrained_backward_mode}
                     {pretrained_backward_type}
                     {marian_args}

diff --git a/taskcluster/kinds/train-student/kind.yml b/taskcluster/kinds/train-student/kind.yml
@@ -29,6 +29,7 @@ tasks:
                 best_model: training_config.experiment.best-model
                 src_locale: training_config.experiment.src
                 trg_locale: training_config.experiment.trg
+                student_model: training_config.experiment.student-model
                 wandb_publication: training_config.wandb-publication
                 owner: owner
             substitution-fields:
@@ -46,7 +47,7 @@ tasks:
             cache:
                 type: train-student
                 resources:
-                    - pipeline/train/configs/model/student.yml
+                    - pipeline/train/configs/model/student.{student_model}.yml
                     - pipeline/train/configs/opustrainer/student.yml
                     - pipeline/train/configs/training/student.train.yml
                     - pipeline/train/train.py
@@ -111,6 +112,7 @@ tasks:
                     $MOZ_FETCHES_DIR/corpus.aln.zst
                     0
                     None
+                    {student_model}
                     None
                     None
                     {marian_args}

diff --git a/taskcluster/kinds/train-teacher/kind.yml b/taskcluster/kinds/train-teacher/kind.yml
@@ -137,6 +137,7 @@ tasks:
                     $MOZ_FETCHES_DIR/corpus.aln.zst,$MOZ_FETCHES_DIR/mono.aln.zst
                     {{this_chunk}}
                     {teacher_mode}
+                    None
                     {pretrained_teacher_mode}
                     {pretrained_teacher_type}
                     {marian_args}

diff --git a/taskcluster/scripts/pipeline/train-taskcluster.sh b/taskcluster/scripts/pipeline/train-taskcluster.sh
@@ -23,9 +23,10 @@ best_model_metric=$8
 alignments=$9
 seed=${10}
 teacher_mode=${11}
-pretrained_model_mode=${12}
-pretrained_model_type=${13}
-extra_marian_args=( "${@:14}" )
+student_model=${12}
+pretrained_model_mode=${13}
+pretrained_model_type=${14}
+extra_marian_args=( "${@:15}" )
 
 if [ "$pretrained_model_mode" != "use" ]; then
     # MOZ_FETCHES_DIR is not required for the "use" pretrained model mode
@@ -53,6 +54,7 @@ case "$pretrained_model_mode" in
         fi
         python3 $VCS_ROOT/pipeline/train/train.py \
         --model_type "$model_type" \
+        --student_model "$student_model" \
         --training_type "$training_type" \
         --src "$src" \
         --trg "$trg" \

diff --git a/taskcluster/test/params/large-lt-en.yml b/taskcluster/test/params/large-lt-en.yml
@@ -150,6 +150,7 @@ training_config:
     src: lt
     teacher-ensemble: 2
     teacher-mode: 'two-stage'
+    student-model: 'base'
     trg: en
     use-opuscleaner: 'false'
     vocab: NOT-YET-SUPPORTED

diff --git a/taskcluster/test/params/small-ru-en.yml b/taskcluster/test/params/small-ru-en.yml
@@ -62,6 +62,7 @@ training_config:
     src: ru
     teacher-ensemble: 1
     teacher-mode: 'two-stage'
+    student-model: 'tiny'
     trg: en
     use-opuscleaner: 'true'
     vocab: NOT-YET-SUPPORTED

diff --git a/taskcluster/translations_taskgraph/actions/train.py b/taskcluster/translations_taskgraph/actions/train.py
@@ -130,6 +130,12 @@ def validate_pretrained_models(params):
                         "enum": ["one-stage", "two-stage"],
                         "default": "two-stage",
                     },
+                    "student-model": {
+                        "type": "string",
+                        "description": "Student model configuration",
+                        "enum": ["tiny", "base"],
+                        "default": "tiny",
+                    },
                     "mono-max-sentences-src": {
                         "type": "object",
                         "default": defaults["experiment"]["mono-max-sentences-src"],

diff --git a/taskcluster/translations_taskgraph/parameters.py b/taskcluster/translations_taskgraph/parameters.py
@@ -37,6 +37,7 @@ def get_ci_training_config(_=None) -> dict:
                 Required("trg"): str,
                 Required("teacher-ensemble"): int,
                 Required("teacher-mode"): str,
+                Required("student-model"): str,
                 Optional("corpus-max-sentences"): int,
                 Required("mono-max-sentences-trg"): {
                     Required("total"): int,

diff --git a/tests/fixtures/config.pytest.yml b/tests/fixtures/config.pytest.yml
@@ -21,6 +21,7 @@ experiment:
   spm-sample-size: 10_000_000
   teacher-ensemble: 2
   teacher-mode: "two-stage"
+  student-model: "tiny"
   backward-model: NOT-YET-SUPPORTED
   vocab: NOT-YET-SUPPORTED
 datasets: