For ASR track, we use Conformer implemented by Espnet to conduct speech recognition. 160h development set is devided into two part: 140h audio recordings are merged with MAGICDATA Mandarin Chinese Read Speech Corpus (openslr-68) for training, while the other 20h audio recordings are reserved for testing.
Run prepare_magicdata_160h.py and prepare_magicdata_750h.py under scripys folder.
./run.sh
For scoring, sclite of Espnet could be used to obtain WER.
sclite -r ${ref_path} trn -h ${output_path} trn -i rm -o all stdout > ${result_path}
Method | Corr | Sub | Del | Ins | Err |
---|---|---|---|---|---|
Conformer | 80.1 | 13.7 | 6.3 | 2.8 | 22.8 |
For speaker diarization track, we use VBHMM x-vectors (aka VBx) trained by VoxCeleb Data (openslr-49) and CN-Celeb Corpus (openslr-82) on this task. X-vectors embeddings are extracted by ResNet, and besides, agglomerative hierarchical clustering with variational Bayes HMM resegmentation are conducted to get final result.
Run prepare_magicdata_160h.py under scripys folder.
./run_extract_embedding.sh
./run_clustering.sh
For scoring, DIHARD Socring Tools could be used to calculate DER, JER and so on. We already add this repo as a git submodule under our project.
git submodule update --init --recursive
cd sd/dscore
python score.py --collar 0.25 -r ${groundtruth_rttm} -s ${predicted_rttm}
Method | DER | JER |
---|---|---|
VBx | 7.89 | 47.47 |
Kaldi Espnet VBx DIHARD Socring Tools
MAGICDATA Mandarin Chinese Read Speech Corpus (openslr-68)
Baidu Cloud Drive (Password: utwh)
[1] Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N.E.Y., Heymann, J., Wiesner, M., Chen, N. and Renduchintala, A., 2018. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.
[2] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P. and Silovsky, J., 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
[3] Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R., 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
[4] Watanabe, S., Hori, T., Kim, S., Hershey, J.R. and Hayashi, T., 2017. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), pp.1240-1253.
[5] Landini, F., Wang, S., Diez, M., Burget, L., Matějka, P., Žmolíková, K., Mošner, L., Silnova, A., Plchot, O., Novotný, O. and Zeinali, H., 2020, May. But system for the second dihard speech diarization challenge. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6529-6533). IEEE.
[6] Diez, M., Burget, L., Landini, F. and Černocký, J., 2019. Analysis of speaker diarization based on Bayesian HMM with eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, pp.355-368.
[7] Ryant, N., Church, K., Cieri, C., Du, J., Ganapathy, S. and Liberman, M., 2020. Third DIHARD challenge evaluation plan. arXiv preprint arXiv:2006.05815.
[8] Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora, A., Chang, X., Khudanpur, S., Manohar, V., Povey, D., Raj, D. and Snyder, D., 2020. CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. arXiv preprint arXiv:2004.09249.
[9] Fu, Y., Cheng, L., Lv, S., Jv, Y., Kong, Y., Chen, Z., Hu, Y., Xie, L., Wu, J., Bu, H. and Xu, X., 2021. AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario. arXiv preprint arXiv:2104.03603.