Skip to content

axuan731/Magic-Data-ASR-SD-Challenge

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Magic Data ASR-SD Challenge


ASR Track

For ASR track, we use Conformer implemented by Espnet to conduct speech recognition. 160h development set is devided into two part: 140h audio recordings are merged with MAGICDATA Mandarin Chinese Read Speech Corpus (openslr-68) for training, while the other 20h audio recordings are reserved for testing.


Data Preparation:

Run prepare_magicdata_160h.py and prepare_magicdata_750h.py under scripys folder.


Network Training:

./run.sh

Decoding & Scoring:

For scoring, sclite of Espnet could be used to obtain WER.

sclite -r ${ref_path} trn -h ${output_path} trn -i rm -o all stdout > ${result_path}

Result:

Method Corr Sub Del Ins Err
Conformer 80.1 13.7 6.3 2.8 22.8

SD Track

For speaker diarization track, we use VBHMM x-vectors (aka VBx) trained by VoxCeleb Data (openslr-49) and CN-Celeb Corpus (openslr-82) on this task. X-vectors embeddings are extracted by ResNet, and besides, agglomerative hierarchical clustering with variational Bayes HMM resegmentation are conducted to get final result.


Data Preparation:

Run prepare_magicdata_160h.py under scripys folder.


Testing & Scoring:

./run_extract_embedding.sh
./run_clustering.sh

For scoring, DIHARD Socring Tools could be used to calculate DER, JER and so on. We already add this repo as a git submodule under our project.

git submodule update --init --recursive
cd sd/dscore
python score.py --collar 0.25 -r ${groundtruth_rttm} -s ${predicted_rttm}

Result:

Method DER JER
VBx 7.89 47.47

Reference Resource

Open Source project:

Kaldi Espnet VBx DIHARD Socring Tools

Dataset:

MAGICDATA Mandarin Chinese Read Speech Corpus (openslr-68)

VoxCeleb Data (openslr-49)

CN-Celeb Corpus (openslr-82)


Model:

Baidu Cloud Drive (Password: utwh)


Reference Paper

[1] Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N.E.Y., Heymann, J., Wiesner, M., Chen, N. and Renduchintala, A., 2018. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.

[2] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P. and Silovsky, J., 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.

[3] Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R., 2020. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.

[4] Watanabe, S., Hori, T., Kim, S., Hershey, J.R. and Hayashi, T., 2017. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), pp.1240-1253.

[5] Landini, F., Wang, S., Diez, M., Burget, L., Matějka, P., Žmolíková, K., Mošner, L., Silnova, A., Plchot, O., Novotný, O. and Zeinali, H., 2020, May. But system for the second dihard speech diarization challenge. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6529-6533). IEEE.

[6] Diez, M., Burget, L., Landini, F. and Černocký, J., 2019. Analysis of speaker diarization based on Bayesian HMM with eigenvoice priors. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, pp.355-368.

[7] Ryant, N., Church, K., Cieri, C., Du, J., Ganapathy, S. and Liberman, M., 2020. Third DIHARD challenge evaluation plan. arXiv preprint arXiv:2006.05815.

[8] Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora, A., Chang, X., Khudanpur, S., Manohar, V., Povey, D., Raj, D. and Snyder, D., 2020. CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. arXiv preprint arXiv:2004.09249.

[9] Fu, Y., Cheng, L., Lv, S., Jv, Y., Kong, Y., Chen, Z., Hu, Y., Xie, L., Wu, J., Bu, H. and Xu, X., 2021. AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario. arXiv preprint arXiv:2104.03603.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 51.9%
  • Python 39.2%
  • Perl 8.9%