A Multimodal Retrieval Augmented Generation model built using Gradio for interface, LanceDB for vector database, mm-rag library which allowed additional features like Bridgetower for embeddings, LanceMultimodal for specific Db usecase, and LVLM for vision-natutal language interface interaction.
Inspired by Intel labs's resources on Multimodal RAG: Chat with Videos by Vasudev Lal