Translation¶
-This example is for training for the WMT’14 English to German news translation task. It will use on the fly tokenization with sentencepiece and sacrebleu for evaluation.
-Step 0: Download the data and prepare the subwords model¶
-Preliminary steps are defined in the examples/scripts/prepare_wmt_data.sh
. The following command will download the necessary datasets, and prepare a sentencepiece model:
chmod u+x prepare_wmt_data.sh
-./prepare_wmt_data.sh
-
Note: you should have installed sentencepiece binaries before running this script.
-Step 1. Build the vocabulary.¶
-We need to setup the desired configuration with 1. the data 2. the tokenization options:
-# wmt14_en_de.yaml
-save_data: data/wmt/run/example
-## Where the vocab(s) will be written
-src_vocab: data/wmt/run/example.vocab.src
-tgt_vocab: data/wmt/run/example.vocab.tgt
-
-# Corpus opts:
-data:
- commoncrawl:
- path_src: data/wmt/commoncrawl.de-en.en
- path_tgt: data/wmt/commoncrawl.de-en.de
- transforms: [sentencepiece, filtertoolong]
- weight: 23
- europarl:
- path_src: data/wmt/europarl-v7.de-en.en
- path_tgt: data/wmt/europarl-v7.de-en.de
- transforms: [sentencepiece, filtertoolong]
- weight: 19
- news_commentary:
- path_src: data/wmt/news-commentary-v11.de-en.en
- path_tgt: data/wmt/news-commentary-v11.de-en.de
- transforms: [sentencepiece, filtertoolong]
- weight: 3
- valid:
- path_src: data/wmt/valid.en
- path_tgt: data/wmt/valid.de
- transforms: [sentencepiece]
-
-### Transform related opts:
-#### Subword
-src_subword_model: data/wmt/wmtende.model
-tgt_subword_model: data/wmt/wmtende.model
-src_subword_nbest: 1
-src_subword_alpha: 0.0
-tgt_subword_nbest: 1
-tgt_subword_alpha: 0.0
-#### Filter
-src_seq_length: 150
-tgt_seq_length: 150
-
-# silently ignore empty lines in the data
-skip_empty_level: silent
-
Then we can execute the vocabulary building script. Let’s set -n_sample
to -1
to compute the vocabulary over the whole corpora:
onmt_build_vocab -config wmt14_en_de.yaml -n_sample -1
-
Step 2: Train the model¶
-We need to add the following parameters to the YAML configuration:
-...
-
-# General opts
-save_model: data/wmt/run/model
-keep_checkpoint: 50
-save_checkpoint_steps: 5000
-average_decay: 0.0005
-seed: 1234
-report_every: 100
-train_steps: 100000
-valid_steps: 5000
-
-# Batching
-queue_size: 10000
-bucket_size: 32768
-world_size: 2
-gpu_ranks: [0, 1]
-batch_type: "tokens"
-batch_size: 4096
-valid_batch_size: 16
-batch_size_multiple: 1
-max_generator_batches: 0
-accum_count: [3]
-accum_steps: [0]
-
-# Optimization
-model_dtype: "fp32"
-optim: "adam"
-learning_rate: 2
-warmup_steps: 8000
-decay_method: "noam"
-adam_beta2: 0.998
-max_grad_norm: 0
-label_smoothing: 0.1
-param_init: 0
-param_init_glorot: true
-normalization: "tokens"
-
-# Model
-encoder_type: transformer
-decoder_type: transformer
-enc_layers: 6
-dec_layers: 6
-heads: 8
-rnn_size: 512
-word_vec_size: 512
-transformer_ff: 2048
-dropout_steps: [0]
-dropout: [0.1]
-attention_dropout: [0.1]
-share_decoder_embeddings: true
-share_embeddings: true
-
Step 3: Translate and evaluate¶
-We need to tokenize the testset with the same sentencepiece model as used in training:
-spm_encode --model=data/wmt/wmtende.model \
- < data/wmt/test.en \
- > data/wmt/test.en.sp
-spm_encode --model=data/wmt/wmtende.model \
- < data/wmt/test.de \
- > data/wmt/test.de.sp
-
We can translate the testset with the following command:
-for checkpoint in data/wmt/run/model_step*.pt; do
- echo "# Translating with checkpoint $checkpoint"
- base=$(basename $checkpoint)
- onmt_translate \
- -gpu 0 \
- -batch_size 16384 -batch_type tokens \
- -beam_size 5 \
- -model $checkpoint \
- -src data/wmt/test.en.sp \
- -tgt data/wmt/test.de.sp \
- -output data/wmt/test.de.hyp_${base%.*}.sp
-done
-
Prior to evaluation, we need to detokenize the hypothesis:
-for checkpoint in data/wmt/run/model_step*.pt; do
- base=$(basename $checkpoint)
- spm_decode \
- -model=data/wmt/wmtende.model \
- -input_format=piece \
- < data/wmt/test.de.hyp_${base%.*}.sp \
- > data/wmt/test.de.hyp_${base%.*}
-done
-
Finally, we can compute detokenized BLEU with sacrebleu
:
for checkpoint in data/wmt/run/model_step*.pt; do
- echo "$checkpoint"
- base=$(basename $checkpoint)
- sacrebleu data/wmt/test.de < data/wmt/test.de.hyp_${base%.*}
-done
-