Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to understand the decoder part? #104

Open
JohnHerry opened this issue Feb 25, 2025 · 2 comments
Open

How to understand the decoder part? #104

JohnHerry opened this issue Feb 25, 2025 · 2 comments

Comments

@JohnHerry
Copy link

In the paper the decoder part is compoused of three modules: An 3B LLM that input with textual, An flow matching that input with dual-tokens, and a vocoder.
Is it means, that, the 130B Step-Audio Omni model can generate both textual and audio-token typed response? so that when it answer with text, we can use the LLM + vocoder, while when it output with speech-tokens, we use flow-matching + vocoder?
Or
the 130B Step-Audio can response ONLY with textual tokens, so we should firstly use the decoder LLM to translate the textural tokens into dual-tokens; and then flow-maching and vocoder?
which one?
thanks!

@yanchaomars
Copy link
Collaborator

130B Step-Audio Omni model can generate both textual and audio-token typed response, but the published version let 130B emits text token and then the 3B LLM finished the rest.

@JohnHerry
Copy link
Author

130B Step-Audio Omni model can generate both textual and audio-token typed response, but the published version let 130B emits text token and then the 3B LLM finished the rest.

OK, so the 3B LLM is a text to speech token translator, just like normal AR token based SLM eg. CosyVoice, and all input speech undertanding and repsponse ability just lying on the 130B Omni model? just like the 130B LLM is the Brain, and the 3B LLM is the Mouth?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants