You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to request support for OuteTTS in this repository.
OuteTTS performs text-to-speech (TTS) using a large language model (Qwen2.5 0.5B) to generate audio tokens. This approach enables streaming TTS, making it efficient and responsive.
Additionally, OuteTTS includes a GGUF version with an impressively low real-time factor, making it suitable for resource-constrained environments.
Model Sizes and Performance
OuteTTS offers multiple quantization levels, catering to different needs:
q2: Smallest model, highest real-time factor, but lower accuracy.
q3: Balanced option with average performance.
q4: Ideal for edge devices; offers the best accuracy while maintaining efficiency. This size is widely recommended.
q5 to q8: Larger models with increasing accuracy at the cost of higher memory usage.
q8 achieves accuracy close to FP16 precision.
Benefits of Adding OuteTTS Support:
Enables real-time, low-latency TTS.
Scales across a range of devices, from low-power edge devices to high-performance systems.
Provides flexibility with different model sizes to balance accuracy and efficiency.
Thank you for considering this feature request!
The text was updated successfully, but these errors were encountered:
Thanks for suggesting OuteTTS and giving these detailed insights.
After testing I found OuteTTS often generates artifacts at synthesis start. It fades down at the end too fast, so sometimes you can't really hear the last word. It also lacks the expressiveness of alternatives like XTTS and StyleTTS2, these are more emotional. The requirement to install llama.cpp for GGUF makes setup complex for users, without synthesis is not that fast. Models like Kokoro-82M offer better performance and quality.
So for these reasons I won’t be adding OuteTTS currently.
I would like to request support for OuteTTS in this repository.
OuteTTS performs text-to-speech (TTS) using a large language model (Qwen2.5 0.5B) to generate audio tokens. This approach enables streaming TTS, making it efficient and responsive.
Additionally, OuteTTS includes a GGUF version with an impressively low real-time factor, making it suitable for resource-constrained environments.
Model Sizes and Performance
OuteTTS offers multiple quantization levels, catering to different needs:
Benefits of Adding OuteTTS Support:
Thank you for considering this feature request!
The text was updated successfully, but these errors were encountered: