-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CER Performance of Reconstructed Audio #34
Comments
The CER or WER results seem satisfactory from our experiments. Could you kindly provide further experimental details, such as whether you used the WavTokenizer-small or WavTokenizer-medium version? Additionally, on which test set were the evaluations conducted? Please note that the WavTokenizer-small version has very limited generalization capability. |
I trained the wavtokenizer on about 60,000 hours of data, with a 1:1 ratio of English to Chinese data. I have trained for 3 epochs so far, and when checking the reconstruction of Chinese, I found some incorrect pronunciations. |
Training for only three epochs seems insufficient. Since the data is randomly sampled during training, it means that a full pass through the dataset has not yet been completed. Extending the training to 12-24 epochs could potentially yield better results. |
After restoring our own Korean speech data using the WavTokenizer-medium-speech-75token checkpoint and measuring the CER, there was a significant drop in performance. Could you share the CER or WER comparison results you conducted? In our experiment, we obtained the following results:
|
The wavTokenizer-medium-speech model was trained on a very limited amount of Korean data, making this phenomenon reasonable. You may consider testing the WER or CER on the English test set (LibriTTS testclean). Additionally, retraining a version of WavTokenizer with Korean data is likely to yield significantly improved performance. |
@YoungloLee @jishengpeng Could you please share the loss curves for your model trained with 60,000 and 80,000 hours of data? |
When using the 40 tokens/s configuration, although the quality of the reconstructed audio is very good, there are often some mispronunciations. Have you measured the CER performance of the reconstructed audio?
The text was updated successfully, but these errors were encountered: