-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for non-text modalities (images, speech, video) #316
Comments
For model selection we can use an llm to determine modality from user prompt and then retrieve appropriate dataset and model. I think dataset generation would entail using another model retriever module to select a generative model for the modality of interest, that is if model performance increases only and if not dataset retrieval would only be used. In the case of non text ouptut evaluation, perhaps using an appropriate evaluation metric using the huggingface evaluation library which is also retrieved. |
Cool. Some HCI faculties in Tsinghua also talked me with multi-modalities Prompt2Model. |
For other modalities (e.g. visual QA, video anomaly detection, image generation, speech-to-text, text-to-speech etc.), it would be nice for start to just propose existing datasets and/or models, since prompt2model is advertised as a better solution for retrieval of datasets/models than search engines and human manual searching. |
Currently prompt2model is limited to text input text output tasks. The underlying framework can certainly handle different modalities, and it would be great to see prompt2model be able to handle different types of tasks as well (such as image classification/generation, speech tasks, etc.).
But we'll probably need to think about several things such as:
We can start discussing the necessary steps on this issue, and implement the necessary pieces bit-by-bit. We'd be happy for contributions!
The text was updated successfully, but these errors were encountered: