This section introduces the data preparation for training and evaluation. Following will be based on MuST-C En-De.
- Download and unpack the package.
cd ${DATA_ROOT}
tar -zxvf MUSTC_v2.0_en-de.tar.gz
- Inside
DATA/get_mustc.sh
, cnfigure the correct paths:
# the path where the data is unpacked.
DATA_ROOT=/livingrooms/george/mustc
FAIRSEQ=~/utility/fairseq
# IF NEEDED, activate your python environments
source ~/envs/apex/bin/activate
- Preprocess data with
cd DATA
bash mustc/get_mustc.sh
The fbank and manifest files should appear under ${DATA_ROOT}/en-de/
.
- Inside
DATA/get_data_mt.sh
, cnfigure the correct paths:
# the path where the data is unpacked.
DATA_ROOT=/livingrooms/george/mustc
FAIRSEQ=~/utility/fairseq
# IF NEEDED, activate your python environments
source ~/envs/apex/bin/activate
- (Optional) Preprocess data for MT (Seq-KD) with
cd DATA
bash mustc/get_data_mt.sh
The files should appear under ${DATA_ROOT}/en-de/mt/
.
- Configure environment and path in
data_path.sh
before training:
export SRC=en
export TGT=de
export DATA_ROOT=/livingrooms/george/mustc
export DATA=${DATA_ROOT}/${SRC}-${TGT}
FAIRSEQ=~/utility/fairseq
USERDIR=`realpath ../codebase`
export PYTHONPATH="$FAIRSEQ:$PYTHONPATH"
# IF NEEDED, activate your python environments
source ~/envs/apex/bin/activate
- (Optional) To migrate data to a new system, change paths in
scripts/migrate_data_path.sh
:
ROOT=/media/george/Data/mustc/en-de # new data path
from=/livingrooms/george/mustc/en-de # old data path
to=${ROOT}
Then run
bash scripts/migrate_data_path.sh