多模态交流QQ群: 237976286
- 2025.01 vikhyatk/moondream2
- 2025.01 VideoRAG: Retrieval-Augmented Generation over Video Corpus
- 2025.01 MiniCPM-o MiniCPM-V升级版。
- 2025.01 Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
- 2025.01 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
- 2025.01 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining 达摩院开源的多模态数据集,由22,000小时的上课视频而来。
- 2024.12 MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
- 2024.12 Apollo: An Exploration of Video Understanding in Large Multimodal Models Meta出品的Video-LLM
- 2024.12 DeepSeek-VL2
- 2024.12 FastVLM: Efficient Vision Encoding for Vision Language Models
- 2024.12 POINTS1.5 Buiding a Vision-Language Model towards Real World Applications 微信出品。
- 2024.12 InternVL 2.5 1B 到 78B 都有。
- 2024.12 Qwen2-VL-72B Model
- 2024.12 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
- 2024.12 NVILA: Efficient Frontier Visual Language Models NVIDIA出品,同时优化效率和准确率的VLM。
- 2024.12 PaliGemma 2:A Family of Versatile VLMs for Transfer
- 2024.11 Multimodal Autoregressive Pre-training of Large Vision Encoders 苹果提出全新的视觉编码器训练方式,支持多模态。
- 2024.11 Pixtral Large Mistral发布124B的多模态大模型。
- 2024.11 OmniVision-968M: World's Smallest Vision Language Model
- 2024.11 LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation 微软出品,将CLIP中的text encoder替换成LLM,支持更长的上下文和更复杂的文本,有更好的topk检索效果。
- 2024.11 HourVideo: 1-Hour Video-Language Understanding 李飞飞团队提出长视频理解评测集
- 2024.11 Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
- 2024.11 MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS 英伟达提出基于MLLM的通用多模态检索。
- 2024.11 Attacking Vision-Language Computer Agents via Pop-ups
- 2024.11 Know Where You're Uncertain When Planning with Multimodal Foundation Models: A Formal Framework 提高多模态基础模型在处理不确定性时的能力,从而增强机器人在规划任务中的可靠性。
- 2024.10 Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
- 2024.10 LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Meta提出长视频理解方法。
- 2024.10 Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data 智源开源4千万多模态指令数据。
- 2024.10 Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
- 2024.10 VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation VILA团队的统一理解和生成模型。
- 2024.10 Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation DeepSeek首个多模态模型。
- 2024.10 ARIA : An Open Multimodal Native Mixture-of-Experts Model 3.9B模型,号称超过 Pixtral-12B 和 Llama3.2-11。
- 2024.10 BAICHUAN-OMNI TECHNICAL REPORT 百川首个7B多模态模型。
- 2024.10 Pixtral 12B Mistral出品。
- 2024.10 Movie Gen: A Cast of Media Foundation Models Meta出品
- 2024.10 LEOPARD : A Vision Language Model for Text-Rich Multi-Image Tasks
- 2024.10 Video Instruction Tuning with Synthetic Data LLaVA和字节合作开源视频指令数据
- 2024.09 MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning 苹果MM1的升级版。
- 2024.09 Emu3: Next-Token Prediction is All You Need BAAI出品。
- 2024.09 Molmo and PixMo:Open Weights and Open Data for State-of-the-Art Multimodal Models Allen出品,同时开源模型和数据。
- 2024.09 MIO: A Foundation Model on Multimodal Tokens
- 2024.09 Phantom of Latent for Large Language and Vision Models
- 2024.09 Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
- 2024.09 Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
- 2024.09 NVLM: Open Frontier-Class Multimodal LLMs 英伟达出品。
- 2024.09 Viper: Open Mamba-based Vision-Language Models 首个基于Mamba的VLM系列
- 2024.09 MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
- 2024.09 General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
- 2024.09 VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
- 2024.08 Law of Vision Representation in MLLMs 提出了AC score指标,AC score越高,视觉表示越好。
- 2024.08 CogVLM2: Visual Language Models for Image and Video Understanding
- 2024.08 EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
- 2024.08 A Practitioner's Guide to Continual Multimodal Pretraining
- 2024.08 Building and better understanding vision-language models: insights and future directions
- 2024.08 LongVILA: Scaling Long-Context Visual Language Models for Long Videos
- 2024.08 UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
- 2024.08 xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
- 2024.08 LLaVA-OneVision: Easy Visual Task Transfer LLaVA-NeXT系列的集大成。
- 2024.08 MiniCPM-V: A GPT-4V Level MLLM on Your Phone 超强的小钢炮MLLM。
- 2024.08 SAM 2: Segment Anything in Images and Videos
- 2021.02 Learning Transferable Visual Models From Natural Language Supervision CLIP
- 2022.04 Flamingo: a Visual Language Model for Few-Shot Learning DeepMind出品,MLLM先驱。
- 2023.01 BLIP-2 提出Q-Former。
- 2023.03 Sigmoid Loss for Language Image Pre-Training CLIP的变种替代品,Sigmoid损失。
- 2023.04 MiniGPT-4 热度很高。
- 2023.04 Visual Instruction Tuning LLaVA系列的第一篇文章。
- 2023.05 InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
- 2023.05 Segment Anything SAM
- 2023.12 Gemini: A Family of Highly Capable Multimodal Models
- 2024.01 AGENT AI:SURVEYING THE HORIZONS OF MULTIMODAL INTERACTION 李飞飞团队出品。
- 2024.04 MM1- Methods, Analysis & Insights from Multimodal LLM Pre-training 苹果出品。
- 2024.05 An Introduction to Vision-Language Modeling Meta出品,短小精悍。
- 2024.05 DeepSeek-VL: Towards Real-World Vision-Language Understanding
- 2024.06 Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs 提出以视觉为中心的benchmark CV-Bench,实验探究各个方面对VLM表现的影响,训练Cambrian-1模型。
- 2024.09 Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
- 2024.09 Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
- 2024.09 Molmo and PixMo:Open Weights and Open Data for State-of-the-Art Multimodal Models Allen出品,同时开源模型和数据。