How to understand the decoder part? #104

JohnHerry · 2025-02-25T08:27:22Z

In the paper the decoder part is compoused of three modules: An 3B LLM that input with textual, An flow matching that input with dual-tokens, and a vocoder.
Is it means, that, the 130B Step-Audio Omni model can generate both textual and audio-token typed response? so that when it answer with text, we can use the LLM + vocoder, while when it output with speech-tokens, we use flow-matching + vocoder?
Or
the 130B Step-Audio can response ONLY with textual tokens, so we should firstly use the decoder LLM to translate the textural tokens into dual-tokens; and then flow-maching and vocoder?
which one?
thanks!

yanchaomars · 2025-02-28T02:30:30Z

130B Step-Audio Omni model can generate both textual and audio-token typed response, but the published version let 130B emits text token and then the 3B LLM finished the rest.

JohnHerry · 2025-02-28T02:51:22Z

130B Step-Audio Omni model can generate both textual and audio-token typed response, but the published version let 130B emits text token and then the 3B LLM finished the rest.

OK, so the 3B LLM is a text to speech token translator, just like normal AR token based SLM eg. CosyVoice, and all input speech undertanding and repsponse ability just lying on the 130B Omni model? just like the 130B LLM is the Brain， and the 3B LLM is the Mouth？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to understand the decoder part? #104

How to understand the decoder part? #104

JohnHerry commented Feb 25, 2025

yanchaomars commented Feb 28, 2025

JohnHerry commented Feb 28, 2025

How to understand the decoder part? #104

How to understand the decoder part? #104

Comments

JohnHerry commented Feb 25, 2025

yanchaomars commented Feb 28, 2025

JohnHerry commented Feb 28, 2025