Skip to content

Latest commit

 

History

History
204 lines (128 loc) · 8.35 KB

README.md

File metadata and controls

204 lines (128 loc) · 8.35 KB

Robustness of Machine Translation for Low-Resource Languages

Report

Abstract

It is becoming increasingly common for researchers and practitioners to rely on methods within the field of Neural Machine Translation (NMT) that require the use of an extensive amount of auxiliary data. This is especially true for low-resource NMT where the availability of large-scale corpora is limited. As a result, the field of low-resource NMT without the use of supplementary data has received less attention. This work challenges the idea that modern NMT systems are poorly equipped for low-resource NMT by examining a variety of different systems and techniques in simulated Finnish-English low-resource conditions. This project shows that under certain low-resource conditions, the performance of the Transformer can be considerably improved via simple model compression and regularization techniques. In medium-resource settings, it is shown that an optimized Transformer is competitive with language model fine-tuning, in both in-domain and out-of-domain conditions. As an attempt to further improve robustness towards samples distant from the training distribution, this work explores subword regularization using BPE-Dropout, and defensive distillation. It is found that an optimized Transformer is superior in comparison to subword regularization, whereas defensive distillation improves domain robustness on domains that are the most distant from the original training distribution. A small manual evaluation is implemented where the goal is to assess the robustness of each system and technique towards adequacy and fluency. The results show that under some low-resource conditions, translations generated by most systems are in fact grammatical, however, highly inadequate.

Install required librarires

./scripts/install_libraries.sh 

Download data

./scripts/download_data.sh data

Transformer preprocessing

Truecaser learned on full in-domain Europarl corpus

./scripts/transformer/preprocessing/truecase.sh

In-domain Byte Pair Encoding

./scripts/transformer/preprocessing/preprocess.sh [experiment name] [corpus size] [number of bpe merge operations]

Out-of-domain Byte Pair Encoding

./scripts/transformer/preprocessing/preprocess_ood.sh [experiment name] [corpus size] [domain]

Binarize

In-domain

./scripts/transformer/preprocessing/binarize_transformer.sh [experiment name] [corpus size]

Out-of-domain

./scripts/transformer/preprocessing/binarize_transformer_ood.sh [experiment name] [corpus size]

BPE-Dropout

Copy the training corpus l=64 times

./scripts/transformer/preprocessing/copy_corpus.sh [corpus size]

Apply BPE-Dropout with p = 0.1

./scripts/transformer/preprocessing/preprocess_bpe_dropout.sh [experiment name] [corpus size]

Binarize BPE-Dropout

In-domain

./scripts/transformer/preprocessing/binarize_bpe_dropout.sh [experiment name] [corpus size]

Out-of-domain

./scripts/transformer/preprocessing/binarize_bpe_dropout_ood.sh [experiment name] [corpus size]

Transformer Training and Evaluation

To train an indivudal model, see scripts under scripts/transformer/training

To evaluate an individual model, see scripts under scripts/transformer/evaluation

Find example slurm scripts for training under scripts/transformer/training/slurm

Find example slurm scripts for evaluation under scripts/transformer/evaluation/slurm

Distillation

For distillation to work, first you must have trained a Transformer on one of the europarl subsets following the steps above.

To generate a distilled training set, see scripts/transformer/translate

To prepare distilled training set for the student network:

./scripts/transformer/preprocessing/binarize_distillation.sh [experiment name] [corpus size]

./scripts/transformer/preprocessing/binarize_distillation_ood.sh [experiment name] [corpus size]

To train the student network, see scripts under scripts/transformer/training

To evaluate the student network, see scripts under scripts/transformer/evaluation

Distillation training

For distillation training to work with Fairseq, modify the TransformerDecoder class under /tools/fairseq/fairseq/models/transformer.py to the following:

def upgrade_state_dict_named(self, state_dict, name):
       # Keep the current weights for the decoder embedding table 
       for k in state_dict.keys():
           if 'decoder.embed_tokens' in k:
               state_dict[k] = self.embed_tokens.weight

       """Upgrade a (possibly old) state dict for new versions of fairseq."""
       if isinstance(self.embed_positions, SinusoidalPositionalEmbedding):
           weights_key = "{}.embed_positions.weights".format(name)
           if weights_key in state_dict:
               del state_dict[weights_key]
           state_dict[
               "{}.embed_positions._float_tensor".format(name)
           ] = torch.FloatTensor(1)

       if f"{name}.output_projection.weight" not in state_dict:
           if self.share_input_output_embed:
               embed_out_key = f"{name}.embed_tokens.weight"
           else:
               embed_out_key = f"{name}.embed_out"
           if embed_out_key in state_dict:
               state_dict[f"{name}.output_projection.weight"] = state_dict[
                   embed_out_key
               ]
               if not self.share_input_output_embed:
                   del state_dict[embed_out_key]

       for i in range(self.num_layers):
           # update layer norms
           layer_norm_map = {
               "0": "self_attn_layer_norm",
               "1": "encoder_attn_layer_norm",
               "2": "final_layer_norm",
           }
           for old, new in layer_norm_map.items():
               for m in ("weight", "bias"):
                   k = "{}.layers.{}.layer_norms.{}.{}".format(name, i, old, m)
                   if k in state_dict:
                       state_dict[
                           "{}.layers.{}.{}.{}".format(name, i, new, m)
                       ] = state_dict[k]
                       del state_dict[k]

       version_key = "{}.version".format(name)
       if utils.item(state_dict.get(version_key, torch.Tensor([1]))[0]) <= 2:
           # earlier checkpoints did not normalize after the stack of layers
           self.layer_norm = None
           self.normalize = False
           state_dict[version_key] = torch.Tensor([1])

       return state_dict

This allows you to initialize the parameters of the student network using the parameters of the teacher model.

mBART25

Installing the pretrained model

./scripts/mbart/get_pretrained_model.sh

Tokenization

./scripts/mbart/preprocessing/spm_tokenize.sh [corpus size]

./scripts/mbart/preprocessing/spm_tokenize_ood.sh

Build new dictionary based on in-domain text that is being fine-tuned

./scripts/mbart/build_vocab.sh [corpus size]

Prune the pre-trained model

./scripts/mbart/trim_mbart.sh

Binarize

./scripts/mbart/preprocessing/binarize.sh [corpus size]

./scripts/mbart/preprocessing/binarize_ood.sh [corpus size]

Training and Evaluation

For fine-tuning mBART25, see /scripts/mbart/finetune.sh

For evaluating mBART25, see /scripts/mbart/eval.sh and /scripts/mbart/eval_ood.sh

Find example slurm scripts for training and evaluation in /scripts/mbart/slurm

I encountered a bug when fine-tuning mBART25 which was fixed by modifying the init function of the TranslationFromPretrainedBARTTask class under /tools/fairseq/fairseq/tasks/translation_from_pretrained_bart.py to the following:

def __init__(self, args, src_dict, tgt_dict):
        super().__init__(args, src_dict, tgt_dict)
        self.args = args # required for mbart finetuning, can uncomment otherwise 
        self.langs = args.langs.split(",")
        for d in [src_dict, tgt_dict]:
            for l in self.langs:
                d.add_symbol("[{}]".format(l))
            d.add_symbol("<mask>")

RNN

Build network dictionaries

./scripts/rnn/jsonify.sh [experiment name] [corpus size]

Training and Evaluation

See /scripts/rnn/train.sh and /scripts/rnn/translate.sh

Find example slurm scripts for training and evaluation in /scripts/rnn/slurm