You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently testing the audio transcription (ASR) and audio analysis capabilities of Baichuan-Omni-1.5, but I am experiencing subpar performance compared to the results reported in the Baichuan-Omni paper. Below are the specific issues I have encountered:
Audio Transcription (ASR):
While Baichuan-Omni-1.5 performs better than MiniCPM-Omni in terms of audio transcription (ASR), its audio analysis capabilities seem significantly weaker. The model frequently generates errors such as "Audio file not found" or "Please provide an audio file," even when valid audio input is provided.
Multi-turn Dialogue Issues:
In multi-turn dialogue scenarios, where the model is first tasked with transcribing audio (ASR) and then combining the audio and transcription results to answer follow-up questions, the model often fails to retain context. Specifically, it frequently outputs error messages like "Audio not found," despite the audio being successfully processed in earlier steps.
Lack of Reference Code for Multi-modal Dialogue:
Unlike the MiniCPM-Omni repository, which provides reference implementations for multi-turn dialogue involving audio, images, and videos, the Baichuan-Omni repository does not include similar examples. This makes it difficult to determine whether the issues stem from my implementation or limitations in the model itself.
Request:
Could you please provide reference code or examples for multi-turn dialogue scenarios involving audio, images, and videos? This would help users verify their implementations and ensure that the model's full capabilities are being utilized effectively. Additionally, any insights into resolving the "Audio not found" issue during multi-turn dialogues would be greatly appreciated.
Thank you for your attention to these matters!
Environment Details:
Model Version: Baichuan-Omni-1.5
Framework/Toolkit: [Specify if applicable]
Hardware: [Specify if applicable]
Looking forward to your response!
The text was updated successfully, but these errors were encountered:
I am currently testing the audio transcription (ASR) and audio analysis capabilities of Baichuan-Omni-1.5, but I am experiencing subpar performance compared to the results reported in the Baichuan-Omni paper. Below are the specific issues I have encountered:
Audio Transcription (ASR):
While Baichuan-Omni-1.5 performs better than MiniCPM-Omni in terms of audio transcription (ASR), its audio analysis capabilities seem significantly weaker. The model frequently generates errors such as "Audio file not found" or "Please provide an audio file," even when valid audio input is provided.
Multi-turn Dialogue Issues:
In multi-turn dialogue scenarios, where the model is first tasked with transcribing audio (ASR) and then combining the audio and transcription results to answer follow-up questions, the model often fails to retain context. Specifically, it frequently outputs error messages like "Audio not found," despite the audio being successfully processed in earlier steps.
Lack of Reference Code for Multi-modal Dialogue:
Unlike the MiniCPM-Omni repository, which provides reference implementations for multi-turn dialogue involving audio, images, and videos, the Baichuan-Omni repository does not include similar examples. This makes it difficult to determine whether the issues stem from my implementation or limitations in the model itself.
Request:
Could you please provide reference code or examples for multi-turn dialogue scenarios involving audio, images, and videos? This would help users verify their implementations and ensure that the model's full capabilities are being utilized effectively. Additionally, any insights into resolving the "Audio not found" issue during multi-turn dialogues would be greatly appreciated.
Thank you for your attention to these matters!
Environment Details:
Looking forward to your response!
The text was updated successfully, but these errors were encountered: