Skip to content

Latest commit

 

History

History
108 lines (80 loc) · 6.43 KB

enhanced_direct_s2st_discrete_units.md

File metadata and controls

108 lines (80 loc) · 6.43 KB

Speech to speech translation (S2ST)

We provide the implementation for speech-to-unit translation (S2UT) proposed in Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation (Popuri et al. 2022) and the various pretrained models used.

Pretrained Models

Unit extraction

We used the multilingual HuBERT model open sourced in Textless S2ST with Real Data

Wav2vec 2.0

Language Block type Model size Dataset Model
Es Transformer BASE Voxpopuli ckpt
Es Transformer LARGE Voxpopuli ckpt
Es Conformer LARGE Voxpopuli ckpt
En Transformer BASE Librilight ckpt
En Conformer LARGE Librilight ckpt

Unit mBART

Unit size Dataset Unit config Model
1000 Voxpopuli En, Es unlabelled speech mbart_large ckpt

Data preparation

  1. To prepare data for S2UT finetuning, follow the steps from Direct S2ST with Discrete Units and format the data in the S2UT format. Note that we use 1000 units from the eleventh layer (--layer 11) of the multilingual hubert model linked above instead
  2. Run
var="id\taudio\tn_frames\ttgt_text\ttgt_n_frames"
sed -i "1s/.*/$var/" ${SPLIT}.tsv

Training

Speech-to-unit translation (S2UT)

Here's an example for finetuning S2UT models with 1000 discrete units as target. You can download the config file and vocabulary from here:

fairseq-train $DATA_ROOT \
  --config-yaml config.yaml  \
  --task speech_to_text --arch xm_transformer\
  --criterion l --label-smoothing 0.2 \
  --share-decoder-input-output-embed --adaptor-n-layers 1 --normalize\
  --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
  --train-subset train --valid-subset dev \
  --load-pretrained-decoder-from ${unit_mBART} --w2v-path ${wav2vec2.0} \
  --mask-prob 0.3 --mask-channel-length 32 --mask-channel-prob 0.25\
  --save-dir ${MODEL_DIR} --checkpoint-activations --encoder-proj \
  --lr 0.0005 --dropout 0.1 --attention-dropout 0.1 --lr-scheduler inverse_sqrt\
  --warmup-init-lr 1e-7 --warmup-updates 10000 \
  --optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \
  --max-update 20000 --max-tokens 4000 --max-tokens-valid 4000 --max-source-positions 4000 \
  --max-target-positions 4000 --update-freq 120 \
  --seed 1 --fp16 --num-workers 1
  • Adjust --update-freq accordingly for different #GPUs. In the above we set --update-freq 15 to simulate training with 120 GPUs.
  • In the above setting we finetune the model end to end, corresponding to the full setup in the paper.
  • To apply LNA-E partial finetuning, add --finetune-w2v-params layer_norm,self_attn
  • For LNA-D partial finetuning add --finetune-decoder-params encoder_attn,layer_norm,self_attn. To optionally freeze the encoder by k updates, use --freeze-finetune-updates ${K}
  • For LNA-E,D partial finetuning add both the above options.

Unit-based HiFi-GAN vocoder

We apply the open-sourced unit-based HiFi-GAN vocoders to convert the predicted unit sequences to waveform. They are open sourced in Textless S2ST with Real Data

Inference

Speech-to-unit translation (S2UT)

  1. Follow the same inference process as in fairseq-S2T to generate unit sequences (${RESULTS_PATH}/generate-${GEN_SUBSET}.txt).
fairseq-generate $DATA_ROOT \
  --config-yaml config.yaml \
  --task speech_to_text  \
  --path $MODEL_DIR/checkpoint_best.pt  --gen-subset $GEN_SUBSET \
  --max-tokens 10000 --max-source-positions 10000 --max-target-positions 10000\
  --beam 10 --max-len-a 1 --max-len-b 200 \
  --results-path ${RESULTS_PATH}
  1. Convert unit sequences to waveform.
grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
  sed 's/^D-//ig' | sort -nk1 | cut -f3 \
  > ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit

python examples/speech_to_speech/generate_waveform_from_code.py \
  --in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
  --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
  --results-path ${RESULTS_PATH} --dur-prediction

Evaluation

To evaluate speech translation output, we first apply ASR on the speech output and then compute BLEU score betweent the ASR decoded text and the references using sacreBLEU.