Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support audio in multimodal messages #370

Open
rachwalk opened this issue Jan 16, 2025 · 5 comments
Open

Support audio in multimodal messages #370

rachwalk opened this issue Jan 16, 2025 · 5 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@rachwalk
Copy link
Contributor

Is your feature request related to a problem? Please describe.

LLM APIs have started supporting audio input, so it would be beneficial for RAIMultimodalMessages to support audio as well.

Describe the solution you'd like
MultimodalMessage class (

if self.audios not in [None, []]:
) should support audio input.

Describe alternatives you've considered

This is the only suitable solution within the current architecture.

Additional context

@rachwalk rachwalk added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Jan 16, 2025
@mdimado
Copy link

mdimado commented Jan 17, 2025

from the issue I understood that the changes are to mede in the messages/multimodal.py
and the changes to be made are:

  1. delete the if self.audios not in [None, []]: check that was blocking audio support
  2. add support for base64 encoded audio files in the __init__ method
  3. create audio content entries similar to how images are handled using appropriate mime type for audio (e.g "audio/wav")

should i create a pull request with these changes?

please assign this issue. ill work on it and create a pr
If i'm missing out on something, please let me know

@maciejmajek
Copy link
Member

maciejmajek commented Jan 17, 2025

Hi @mdimado, yes, please feel free to create a PR for this task! A fully completed implementation should include:

  1. A preprocess_audio function, similar to preprocess_image, to handle conversion of various audio formats (e.g., .mp3, .wav, np.array with sampling rate) into a standard format accepted by multimodal vendors.
  2. Validation to ensure the model can process and understand the provided audio content (e.g., compatibility with gpt-4o-audio-preview).

Let me know if you need any further clarification or assistance (here and/or on discord)

@mdimado
Copy link

mdimado commented Jan 17, 2025

thanks for the clarification and additional details. after reviewing the task, i realize implementing the preprocess_audio function and handling validations might need more learning on my part. to ensure timely and high-quality work, i think someone with more expertise could handle this better. apologies for the inconvenience, and i kindly request to unassign myself for now.

@maciejmajek
Copy link
Member

Hey @mdimado, no worries at all! We're all here to learn and grow together—that's what makes this such a great environment. 😊 Feel free to tackle any part of the work you're comfortable with, and don't hesitate to ask for guidance along the way. We’re always happy to help and support you through the process. Looking forward to it! 🚀

@rachwalk
Copy link
Contributor Author

@mdimado I have created sub-issues based on your task description: #373 feel free to comment under it so I can assign you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants