-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Voice Quality Issues On Linux (Potentially Due To eSpeak TTS Driver) #239
Comments
Yeah, the SystemEngine based on pyttsx3 is super basic. Any specific reason you're sticking with it? If you’re after better voice quality, I’d recommend checking out CoquiEngine or StyleTTSEngine for local synthesis. If you’re fine with using an internet-based service, EdgeEngine or GTTSEngine are solid options and way easier to set up. I don’t mess with eSpeak on Linux much, so I can’t give you detailed advice there. Swapping it for eSpeak-NG might help a bit, but honestly, it’ll still sound pretty rough compared to the other engines I mentioned. If you need to use eSpeak, you could try using RVC for post-processing the chunks which will boost the quality. That said, setting it up isn’t exactly straightforward. |
Hi @KoljaB,
Thank you so much for your recommendations! I’ve tried out the engines you suggested, and here’s what I found: 1) CoquiEngine :1.1) It works well overall, but being based on the xTTS v2 model (with a GPT-2 decoder transformer), consistency is an issue (as also discussed in this thread)- which is something I prioritize. 1.2) I also read that CoquiEngine has discontinued operations, which raises concerns about future compatibility. Please correct me if I’m mistaken. 2) StyleTTSEngine:2.1) I initially encountered extreme stuttering after the first word. Here is the error output. I made a small tweak in PyAudio’s init.py file (located at venv/lib/python3.12/site-packages/pyaudio/init.py): from This resolved the stuttering issue, and the engine is now producing great output. However, I was wondering: Do you foresee any potential edge-case issues with this change? 2.2) Regarding the StyleTTS2 voices: Where can I download the required files (model configs, checkpoints, and reference audio) for the voice configurations mentioned in the style_test.py? (For reference) style_test.py:
I used the yl4579/StyleTTS2-LibriTTS for the model files (config & checkpoint), while the reference audio was sourced as a separate WAV file. My setup:
Is this approach correct, or are there better ways to implement it? 3) eSpeak-NG:3.1) While I experimented with eSpeak-NG, the improvement was minimal. Any further guidance would be greatly appreciated! Thank you again for your incredible work and support! Best regards, |
Yeah, consistency with Coqui XTTS is problematic, you're right about that. There's no fix for it I'm aware of. And yes, Coqui itself has shut down, but RealtimeTTS actually uses the Idiap fork, which is still actively maintained and should be good for now. I don't see any obvious issues with your PyAudio change. From the log, it looks like the buffer availability at 225 is super low. You could also lower the subchunk size in RealtimeTTS for better results: https://github.com/KoljaB/RealtimeTTS/blob/master/RealtimeTTS/stream_player.py#L452 The required models for the StyleTTS test file aren't publicly available. I trained those voices myself using voice data I don't own the rights to. I modified pitch and timbre, merged voices together to create those models but as the tts spaces really tightens regarding voice rights and I don't want to risk getting into legal issues I can't provide the trained models publicly, especially not the Nicole model. Your integration approach for StyleTTS looks solid. I recommend training your own model, for example following this tutorial: https://www.youtube.com/watch?v=dCmAbcJ5v5k |
First of all, thank you @KoljaB , for this amazing repository and all the work you've put into it!
I’ve been using RealTimeTTS on Linux and noticed significant voice quality issues. After some investigation, I believe the problem might be related to the TTS driver used by pyttsx3. As you might already know, pyttsx3 supports these three drivers: sapi5 for Windows, nsss for macOS X, and, eSpeak for other platforms (including Linux). On Linux, the eSpeak driver seems to provide very robotic and sometimes gibberish audio output, which often breaks. This appears to be a known limitation of eSpeak. The output I'm getting is something similar to the voice generated in this tutorial at the 10:05 mark: eSpeak YouTube Tutorial
On the other hand, the same code produces significantly better results on Windows with the sapi5 driver. That said, I might be missing something or misunderstanding the root cause here. If there’s any way to either improve the performance of eSpeak on Linux or use an alternative TTS driver to achieve comparable output to Windows, your guidance would be hugely appreciated!
Thanks again for this amazing project and for your continued support! Looking forward to any insights or suggestions from your side!
Best Regards,
SamRaina
The text was updated successfully, but these errors were encountered: