You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper the decoder part is compoused of three modules: An 3B LLM that input with textual, An flow matching that input with dual-tokens, and a vocoder.
Is it means, that, the 130B Step-Audio Omni model can generate both textual and audio-token typed response? so that when it answer with text, we can use the LLM + vocoder, while when it output with speech-tokens, we use flow-matching + vocoder?
Or
the 130B Step-Audio can response ONLY with textual tokens, so we should firstly use the decoder LLM to translate the textural tokens into dual-tokens; and then flow-maching and vocoder?
which one?
thanks!
The text was updated successfully, but these errors were encountered:
130B Step-Audio Omni model can generate both textual and audio-token typed response, but the published version let 130B emits text token and then the 3B LLM finished the rest.
130B Step-Audio Omni model can generate both textual and audio-token typed response, but the published version let 130B emits text token and then the 3B LLM finished the rest.
OK, so the 3B LLM is a text to speech token translator, just like normal AR token based SLM eg. CosyVoice, and all input speech undertanding and repsponse ability just lying on the 130B Omni model? just like the 130B LLM is the Brain, and the 3B LLM is the Mouth?
In the paper the decoder part is compoused of three modules: An 3B LLM that input with textual, An flow matching that input with dual-tokens, and a vocoder.
Is it means, that, the 130B Step-Audio Omni model can generate both textual and audio-token typed response? so that when it answer with text, we can use the LLM + vocoder, while when it output with speech-tokens, we use flow-matching + vocoder?
Or
the 130B Step-Audio can response ONLY with textual tokens, so we should firstly use the decoder LLM to translate the textural tokens into dual-tokens; and then flow-maching and vocoder?
which one?
thanks!
The text was updated successfully, but these errors were encountered: