This is a tutorial of training and evaluating a transformer MMA simultaneous model on MUST-C English-Germen Dataset, from SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation.
MuST-C is multilingual speech-to-text translation corpus with 8-language translations on English TED talks.
See data preparation
The training script for asr is in exp/1a-pretrain_asr.sh.
ASR model with Emformer encoder and Transformer decoder. Pre-trained with joint CTC cross-entropy loss.
MuST-C (WER) | en-de (V2) | en-es |
---|---|---|
dev | 9.65 | 14.44 |
tst-COMMON | 12.85 | 14.02 |
model | download | download |
vocab | download | download |
The training script for mma is in exp/2-mma.sh.
To train a MMA-H model with latency weight 0.1, use
bash 2-mma.sh -t de -m hard_aligned -l 0.1
To train a MMA-IL model with latency weight 0.1, use
bash 2-mma.sh -t de -m infinite_lookback -l 0.1
If you want to finetune from a offline model (latency weight = 0), use
bash 2b-mma_finetune.sh -t de -m infinite_lookback -l 0.1
The evaluation instruction is in simuleval_instruction.md. The MMA uses the default_agent.py.
{
"Quality": {
"BLEU": 22.882280993425326
},
"Latency": {
"AL": 1582.635476344213,
"AL_CA": 1824.0610745999502,
"AP": 0.7660114625870339,
"AP_CA": 0.8291859397671248,
"DAL": 2127.1755059232137,
"DAL_CA": 2391.403942353481
}
}