Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Irvingao · 2024-05-21T05:22:20Z

🚀 The feature, motivation and pitch

As we all know, GPT-4o is an end2end multi-modal models, which support Speech to Text/Speech. I have some ideas about it:

Speech to Text: Can we have a try by combining the pretrained ASR encoder and a trainable linear projection to make Speech to Text possible?
Speech to Speech: Align the pretrained ASR decoder with the main LLM backbone.

Alternatives

No response

Additional context

No response

byrTony-Frankzyq · 2024-05-21T15:58:11Z

For your first idea, I think the asr example have done it.

Irvingao · 2024-05-21T16:14:01Z

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

byrTony-Frankzyq · 2024-05-21T16:22:49Z

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right?

Irvingao · 2024-05-21T16:25:09Z

For your first idea, I think the asr example have done it.

I main speech inputs with LLM outputs.

Your "text" means response, right? Though not fully understand

Exactly.

zszheng147 · 2024-05-22T01:23:52Z

Are you talking about ASR for the speech-to-text task? If so, you can try our ASR example.

We may support speech-to-speech in the future, but as this task is much more difficult than ASR or TTS, it is more like combining these two seamlessly. Thank you for your advice; we will take it into consideration.

If you have any further questions or need additional assistance, feel free to ask!

Learneducn · 2024-07-24T06:06:08Z

I used the SLAM framework to fine-tune the inference results. Why are the test results on librispeech not as good as directly using the whisper open source model?

gpt4o-tech · 2024-11-07T07:34:28Z

I found one that supports both S2T and S2S simultaneously: https://github.com/MooreThreads/MooER

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Irvingao commented May 21, 2024

byrTony-Frankzyq commented May 21, 2024

Irvingao commented May 21, 2024

byrTony-Frankzyq commented May 21, 2024 •

edited

Loading

Irvingao commented May 21, 2024

zszheng147 commented May 22, 2024 •

edited

Loading

Learneducn commented Jul 24, 2024

gpt4o-tech commented Nov 7, 2024

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Do you have any plan about Speech to Text or Speech to Speech End2End models? #78

Comments

Irvingao commented May 21, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

byrTony-Frankzyq commented May 21, 2024

Irvingao commented May 21, 2024

byrTony-Frankzyq commented May 21, 2024 • edited Loading

Irvingao commented May 21, 2024

zszheng147 commented May 22, 2024 • edited Loading

Learneducn commented Jul 24, 2024

gpt4o-tech commented Nov 7, 2024

byrTony-Frankzyq commented May 21, 2024 •

edited

Loading

zszheng147 commented May 22, 2024 •

edited

Loading