Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice Quality Issues On Linux (Potentially Due To eSpeak TTS Driver) #239

Open
samrainax opened this issue Dec 26, 2024 · 3 comments
Open

Comments

@samrainax
Copy link

samrainax commented Dec 26, 2024

First of all, thank you @KoljaB , for this amazing repository and all the work you've put into it!

I’ve been using RealTimeTTS on Linux and noticed significant voice quality issues. After some investigation, I believe the problem might be related to the TTS driver used by pyttsx3. As you might already know, pyttsx3 supports these three drivers: sapi5 for Windows, nsss for macOS X, and, eSpeak for other platforms (including Linux). On Linux, the eSpeak driver seems to provide very robotic and sometimes gibberish audio output, which often breaks. This appears to be a known limitation of eSpeak. The output I'm getting is something similar to the voice generated in this tutorial at the 10:05 mark: eSpeak YouTube Tutorial

On the other hand, the same code produces significantly better results on Windows with the sapi5 driver. That said, I might be missing something or misunderstanding the root cause here. If there’s any way to either improve the performance of eSpeak on Linux or use an alternative TTS driver to achieve comparable output to Windows, your guidance would be hugely appreciated!

Thanks again for this amazing project and for your continued support! Looking forward to any insights or suggestions from your side!

Best Regards,
SamRaina

@KoljaB
Copy link
Owner

KoljaB commented Dec 26, 2024

Yeah, the SystemEngine based on pyttsx3 is super basic. Any specific reason you're sticking with it? If you’re after better voice quality, I’d recommend checking out CoquiEngine or StyleTTSEngine for local synthesis. If you’re fine with using an internet-based service, EdgeEngine or GTTSEngine are solid options and way easier to set up.

I don’t mess with eSpeak on Linux much, so I can’t give you detailed advice there. Swapping it for eSpeak-NG might help a bit, but honestly, it’ll still sound pretty rough compared to the other engines I mentioned. If you need to use eSpeak, you could try using RVC for post-processing the chunks which will boost the quality. That said, setting it up isn’t exactly straightforward.

@samrainax
Copy link
Author

Hi @KoljaB,

  • No particular reason to stick with pyttsx3. I was just looking for a consistent, human-like, local TTS solution that works well on Linux systems, which is why I’m also avoiding internet-based services like Edge Engine and gTTS Engine.

Thank you so much for your recommendations! I’ve tried out the engines you suggested, and here’s what I found:

1) CoquiEngine :

1.1) It works well overall, but being based on the xTTS v2 model (with a GPT-2 decoder transformer), consistency is an issue (as also discussed in this thread)- which is something I prioritize.

1.2) I also read that CoquiEngine has discontinued operations, which raises concerns about future compatibility. Please correct me if I’m mistaken.

2) StyleTTSEngine:

2.1) I initially encountered extreme stuttering after the first word. Here is the error output. I made a small tweak in PyAudio’s init.py file (located at venv/lib/python3.12/site-packages/pyaudio/init.py):

from
frames_per_buffer = pa.paFramesPerBufferUnspecified
to
frames_per_buffer = 256

This resolved the stuttering issue, and the engine is now producing great output. However, I was wondering: Do you foresee any potential edge-case issues with this change?

2.2) Regarding the StyleTTS2 voices: Where can I download the required files (model configs, checkpoints, and reference audio) for the voice configurations mentioned in the style_test.py?

(For reference) style_test.py:

# Adjust these paths to your local setup  
styletts_root = "D:/Dev/StyleTTS_Realtime/StyleTTS2"  

voice_1 = StyleTTSVoice(  
    model_config_path="D:/Data/Models/style/Nicole/config.yml",  
    model_checkpoint_path="D:/Data/Models/style/Nicole/epoch_2nd_00036.pth",  
    ref_audio_path="D:/Data/Models/style/Nicole/file___1_file___1_segment_98.wav"  
)  

I used the yl4579/StyleTTS2-LibriTTS for the model files (config & checkpoint), while the reference audio was sourced as a separate WAV file. My setup:

styletts_root = r"/home/xxxx/Desktop/xxxxx-koljaB/StyleTTS2" 
voice_1 = StyleTTSVoice(
    model_config_path=r"/home/xxxx/Downloads/Models_LJSpeech_config.yml",  
    model_checkpoint_path=r"/home/xxxx/Downloads/epoch_2nd_00100.pth",
    ref_audio_path=r"/home/xxxx/Downloads/v1.wav",
)
engine = StyleTTSEngine(style_root=styletts_root, voice=voice_1)  

Is this approach correct, or are there better ways to implement it?

3) eSpeak-NG:

3.1) While I experimented with eSpeak-NG, the improvement was minimal.


Any further guidance would be greatly appreciated!

Thank you again for your incredible work and support!

Best regards,
SamRaina

@KoljaB
Copy link
Owner

KoljaB commented Dec 31, 2024

Yeah, consistency with Coqui XTTS is problematic, you're right about that. There's no fix for it I'm aware of. And yes, Coqui itself has shut down, but RealtimeTTS actually uses the Idiap fork, which is still actively maintained and should be good for now.

I don't see any obvious issues with your PyAudio change. From the log, it looks like the buffer availability at 225 is super low. You could also lower the subchunk size in RealtimeTTS for better results: https://github.com/KoljaB/RealtimeTTS/blob/master/RealtimeTTS/stream_player.py#L452

The required models for the StyleTTS test file aren't publicly available. I trained those voices myself using voice data I don't own the rights to. I modified pitch and timbre, merged voices together to create those models but as the tts spaces really tightens regarding voice rights and I don't want to risk getting into legal issues I can't provide the trained models publicly, especially not the Nicole model.

Your integration approach for StyleTTS looks solid. I recommend training your own model, for example following this tutorial: https://www.youtube.com/watch?v=dCmAbcJ5v5k

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants