Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Audio Analysis and Multi-turn Dialogue in Baichuan-Omni-1.5 #4

Open
xiexiaoshinick opened this issue Feb 7, 2025 · 1 comment

Comments

@xiexiaoshinick
Copy link

I am currently testing the audio transcription (ASR) and audio analysis capabilities of Baichuan-Omni-1.5, but I am experiencing subpar performance compared to the results reported in the Baichuan-Omni paper. Below are the specific issues I have encountered:

  1. Audio Transcription (ASR):
    While Baichuan-Omni-1.5 performs better than MiniCPM-Omni in terms of audio transcription (ASR), its audio analysis capabilities seem significantly weaker. The model frequently generates errors such as "Audio file not found" or "Please provide an audio file," even when valid audio input is provided.

  2. Multi-turn Dialogue Issues:
    In multi-turn dialogue scenarios, where the model is first tasked with transcribing audio (ASR) and then combining the audio and transcription results to answer follow-up questions, the model often fails to retain context. Specifically, it frequently outputs error messages like "Audio not found," despite the audio being successfully processed in earlier steps.

  3. Lack of Reference Code for Multi-modal Dialogue:
    Unlike the MiniCPM-Omni repository, which provides reference implementations for multi-turn dialogue involving audio, images, and videos, the Baichuan-Omni repository does not include similar examples. This makes it difficult to determine whether the issues stem from my implementation or limitations in the model itself.

Request:
Could you please provide reference code or examples for multi-turn dialogue scenarios involving audio, images, and videos? This would help users verify their implementations and ensure that the model's full capabilities are being utilized effectively. Additionally, any insights into resolving the "Audio not found" issue during multi-turn dialogues would be greatly appreciated.

Thank you for your attention to these matters!

Environment Details:

  • Model Version: Baichuan-Omni-1.5
  • Framework/Toolkit: [Specify if applicable]
  • Hardware: [Specify if applicable]

Looking forward to your response!

@HaoZeSun2016
Copy link
Collaborator

Could you provide the full prompt and audio files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants