It is becoming increasingly common for researchers and practitioners to rely on methods within the field of Neural Machine Translation (NMT) that require the use of an extensive amount of auxiliary data. This is especially true for low-resource NMT where the availability of large-scale corpora is limited. As a result, the field of low-resource NMT without the use of supplementary data has received less attention. This work challenges the idea that modern NMT systems are poorly equipped for low-resource NMT by examining a variety of different systems and techniques in simulated Finnish-English low-resource conditions. This project shows that under certain low-resource conditions, the performance of the Transformer can be considerably improved via simple model compression and regularization techniques. In medium-resource settings, it is shown that an optimized Transformer is competitive with language model fine-tuning, in both in-domain and out-of-domain conditions. As an attempt to further improve robustness towards samples distant from the training distribution, this work explores subword regularization using BPE-Dropout, and defensive distillation. It is found that an optimized Transformer is superior in comparison to subword regularization, whereas defensive distillation improves domain robustness on domains that are the most distant from the original training distribution. A small manual evaluation is implemented where the goal is to assess the robustness of each system and technique towards adequacy and fluency. The results show that under some low-resource conditions, translations generated by most systems are in fact grammatical, however, highly inadequate.
./scripts/install_libraries.sh
./scripts/download_data.sh data
./scripts/transformer/preprocessing/truecase.sh
./scripts/transformer/preprocessing/preprocess.sh [experiment name] [corpus size] [number of bpe merge operations]
./scripts/transformer/preprocessing/preprocess_ood.sh [experiment name] [corpus size] [domain]
./scripts/transformer/preprocessing/binarize_transformer.sh [experiment name] [corpus size]
./scripts/transformer/preprocessing/binarize_transformer_ood.sh [experiment name] [corpus size]
./scripts/transformer/preprocessing/copy_corpus.sh [corpus size]
./scripts/transformer/preprocessing/preprocess_bpe_dropout.sh [experiment name] [corpus size]
./scripts/transformer/preprocessing/binarize_bpe_dropout.sh [experiment name] [corpus size]
./scripts/transformer/preprocessing/binarize_bpe_dropout_ood.sh [experiment name] [corpus size]
To train an indivudal model, see scripts under scripts/transformer/training
To evaluate an individual model, see scripts under scripts/transformer/evaluation
Find example slurm scripts for training under scripts/transformer/training/slurm
Find example slurm scripts for evaluation under scripts/transformer/evaluation/slurm
For distillation to work, first you must have trained a Transformer on one of the europarl subsets following the steps above.
To generate a distilled training set, see scripts/transformer/translate
To prepare distilled training set for the student network:
./scripts/transformer/preprocessing/binarize_distillation.sh [experiment name] [corpus size]
./scripts/transformer/preprocessing/binarize_distillation_ood.sh [experiment name] [corpus size]
To train the student network, see scripts under scripts/transformer/training
To evaluate the student network, see scripts under scripts/transformer/evaluation
For distillation training to work with Fairseq, modify the TransformerDecoder class under /tools/fairseq/fairseq/models/transformer.py
to the following:
def upgrade_state_dict_named(self, state_dict, name):
# Keep the current weights for the decoder embedding table
for k in state_dict.keys():
if 'decoder.embed_tokens' in k:
state_dict[k] = self.embed_tokens.weight
"""Upgrade a (possibly old) state dict for new versions of fairseq."""
if isinstance(self.embed_positions, SinusoidalPositionalEmbedding):
weights_key = "{}.embed_positions.weights".format(name)
if weights_key in state_dict:
del state_dict[weights_key]
state_dict[
"{}.embed_positions._float_tensor".format(name)
] = torch.FloatTensor(1)
if f"{name}.output_projection.weight" not in state_dict:
if self.share_input_output_embed:
embed_out_key = f"{name}.embed_tokens.weight"
else:
embed_out_key = f"{name}.embed_out"
if embed_out_key in state_dict:
state_dict[f"{name}.output_projection.weight"] = state_dict[
embed_out_key
]
if not self.share_input_output_embed:
del state_dict[embed_out_key]
for i in range(self.num_layers):
# update layer norms
layer_norm_map = {
"0": "self_attn_layer_norm",
"1": "encoder_attn_layer_norm",
"2": "final_layer_norm",
}
for old, new in layer_norm_map.items():
for m in ("weight", "bias"):
k = "{}.layers.{}.layer_norms.{}.{}".format(name, i, old, m)
if k in state_dict:
state_dict[
"{}.layers.{}.{}.{}".format(name, i, new, m)
] = state_dict[k]
del state_dict[k]
version_key = "{}.version".format(name)
if utils.item(state_dict.get(version_key, torch.Tensor([1]))[0]) <= 2:
# earlier checkpoints did not normalize after the stack of layers
self.layer_norm = None
self.normalize = False
state_dict[version_key] = torch.Tensor([1])
return state_dict
This allows you to initialize the parameters of the student network using the parameters of the teacher model.
./scripts/mbart/get_pretrained_model.sh
./scripts/mbart/preprocessing/spm_tokenize.sh [corpus size]
./scripts/mbart/preprocessing/spm_tokenize_ood.sh
./scripts/mbart/build_vocab.sh [corpus size]
./scripts/mbart/trim_mbart.sh
./scripts/mbart/preprocessing/binarize.sh [corpus size]
./scripts/mbart/preprocessing/binarize_ood.sh [corpus size]
For fine-tuning mBART25, see /scripts/mbart/finetune.sh
For evaluating mBART25, see /scripts/mbart/eval.sh and /scripts/mbart/eval_ood.sh
Find example slurm scripts for training and evaluation in /scripts/mbart/slurm
I encountered a bug when fine-tuning mBART25 which was fixed by modifying the init function of the TranslationFromPretrainedBARTTask class under /tools/fairseq/fairseq/tasks/translation_from_pretrained_bart.py
to the following:
def __init__(self, args, src_dict, tgt_dict):
super().__init__(args, src_dict, tgt_dict)
self.args = args # required for mbart finetuning, can uncomment otherwise
self.langs = args.langs.split(",")
for d in [src_dict, tgt_dict]:
for l in self.langs:
d.add_symbol("[{}]".format(l))
d.add_symbol("<mask>")
./scripts/rnn/jsonify.sh [experiment name] [corpus size]
See /scripts/rnn/train.sh
and /scripts/rnn/translate.sh
Find example slurm scripts for training and evaluation in /scripts/rnn/slurm