No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance |
|
Sec. 1, Sec. 3.1, Sec. 3.1.1, Sec. 3.1.3, Sec. 3.2, Sec. 3.2.4, Sec. 6.2, Sec. 8.2.1, Table 2 |
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning |
|
Sec. 5.1 |
Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain |
|
Sec. 4.3.1 |
Probing Heterogeneous Pretraining Datasets with Small Curated Datasets |
|
|
ChartLlama: A Multimodal LLM for Chart Understanding and Generation |
|
Sec. 5.1, Sec. 6.3, Sec. 6.4 |
VideoChat: Chat-Centric Video Understanding |
|
Sec. 5.1, Sec. 5.2 |
Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex |
|
Sec. 5.2 |
3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding |
|
Sec. 5.1 |
GPT4MTS: Prompt-based Large Language Model for Multimodal Time-series Forecasting |
|
Sec. 3.1.1, Sec. 5.2 |
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation |
|
Sec. 3.1.1 |
Audio Retrieval with WavText5K and CLAP Training |
|
Sec. 3.1.1, Sec. 3.1.3, Sec. 4.4.3 |
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering |
|
Sec. 3.2.1, Sec. 5.3, Sec. 8.3.3 |
Demystifying CLIP Data |
|
3.2.2 |
Learning Transferable Visual Models From Natural Language Supervision |
|
Sec. 2.1, Sec. 3.1.1, Sec. 3.2.2 |
DataComp: In search of the next generation of multimodal datasets |
|
Sec. 1, Sec. 3.1.1, Sec. 3.1.3, Sec. 3.2, Sec. 3.2.1, Sec. 3.2.4, Sec. 4.4.1, Sec. 5.3, Sec. 8.1, Sec. 8.3.3, Table 2 |
Beyond neural scaling laws: beating power law scaling via data pruning |
|
Sec. 3.2.1 |
Flamingo: a visual language model for few-shot learning |
|
Sec. 3.1.3, Sec. 3.2.2 |
Quality not quantity: On the interaction between dataset design and robustness of clip |
|
Sec. 3.2.2 |
VBench: Comprehensive Benchmark Suite for Video Generative Models |
|
Sec. 4.4.2 |
EvalCraftr: Benchmarking and Evaluating Large Video Generation Models |
|
Sec. 4.4.2 |
Training Compute-Optimal Large Language Models |
|
Sec. 3.1 |
NExT-GPT: Any-to-Any Multimodal LLM |
|
Sec. 1, Sec. 2.1, Sec. 3.1.1 |
ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization |
|
Sec. 3.1.1, Sec. 3.2.4 |
ChartReformer: Natural Language-Driven Chart Image Editing |
|
Sec. 3.1.1, Sec. 6.4 |
GroundingGPT: Language Enhanced Multi-modal Grounding Model |
|
Sec. 4.1.2 |
Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic |
|
Sec. 4.1.1 |
Kosmos-2: Grounding Multimodal Large Language Models to the World |
|
Sec. 4.1.1 |
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters |
|
Sec. 3.2.1, Sec. 5.1, Sec. 5.3, Sec. 8.3.3 |
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training |
|
Sec. 3.2.1, Sec. 8.3.3 |
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation |
|
Sec. 3.1.1, Sec. 3.1.3, Sec. 4.1.3, Sec. 5.1, Sec. 5.4, Sec. 8.2.3, Sec. 8.3.3, Sec. 8.3.4 |
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset |
|
Sec. 4.4.1 |
Structured Packing in LLM Training Improves Long Context Utilization |
|
Sec. 3.2.3 |
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models |
|
Sec. 3.2.3 |
MoDE: CLIP Data Experts via Clustering |
|
Sec. 3.2.3 |
Efficient Multimodal Learning from Data-centric Perspective |
|
Sec. 1, Sec. 2.1, Sec. 3.2.1 |
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs |
|
Sec. 3.1.2 |
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark |
|
Sec. 4.4.1 |
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension |
|
Sec. 4.4.1 |
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models |
|
Sec. 3.1.1 |
Perception Test: A Diagnostic Benchmark for Multimodal Video Models |
|
Sec. 4.4.2 |
FunQA: Towards Surprising Video ComprehensionFunQA: Towards Surprising Video Comprehension |
|
Sec. 4.2.1, Sec. 4.4.4 |
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token |
|
Sec. 4.4.1, Sec. 5.1, Sec. 6.3 |
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning |
|
Sec. 4.4.4, Sec. 6.3 |
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding |
|
Sec. 3.1.1, Sec. 4.2.1, Sec. 6.3 |
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning |
|
Sec. 3.1.1, Sec. 4.4.1 |
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning |
|
Sec. 3.1.3, Sec. 4.4.4, Sec. 5.1, Sec. 6.3 |
WorldGPT: Empowering LLM as Multimodal World Model |
|
Sec. 4.4.2 |
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs |
|
Sec. 3.1.1, Sec. 3.2.2, Sec. 4.1.2 |
TextSquare: Scaling up Text-Centric Visual Instruction Tuning |
|
Sec. 3.1.1, Sec. 5.1, Sec. 5.3, Sec. 5.4, Sec. 8.3.3, Table 2 |
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction |
|
Sec. 3.1.1, Sec. 4.4.1 |
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning? |
|
Sec 6.1 |
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want |
|
Sec. 4.1.1 |
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution |
|
Sec. 3.2.3 |
Fewer Truncations Improve Language Modeling |
|
Sec. 3.2.3 |
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale |
|
Sec. 4.2.2, Sec. 5.2 |
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception |
|
Sec. 5.2 |
UNIAA: A Unified Multi-modal Image Aesthetic Data AugmentationAssessment Baseline and Benchmark |
|
Sec. 4.4.1, Sec. 5.1 |
Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and Negatives |
|
Sec. 3.1.2, Sec. 5.1 |
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation |
|
Sec. 4.1.1, Sec. 4.3.1, Sec. 5.4 |
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models |
|
Sec. 3.1.1 |
The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative |
|
Sec. 4.3.1 |
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs |
|
Sec. 3.1.1, Sec. 5.2 |
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria |
|
Sec. 4.1.3, Sec. 4.4.2, Sec. 5.4, Sec. 8.2.3, Table 2 |
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models |
|
Sec. 4.3.1, Sec. 4.4.2 |
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models |
|
Sec. 3.1.3, Sec. 4.1.2, Sec.4.2.2 |
M3DBench: Let’s Instruct Large Models with Multi-modal 3D Prompts |
|
Sec. 4.4.1 |
MoqaGPT: Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model |
|
Sec. 5.2 |
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding |
|
Sec. 3.1.2, Sec. 6.3 |
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding |
|
Sec. 6.3 |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
|
Sec 3.1.2 |
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model |
|
Sec. 6.3 |
Open-TransMind: A New Baseline and Benchmark for 1st Foundation Model Challenge of Intelligent Transportation |
|
Sec. 4.4.1, Sec. 4.4.3 |
On the Adversarial Robustness of Multi-Modal Foundation Models |
|
Sec 4.3.1 |
What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models |
|
Sec. 4.2.1, Sec. 5.1, Sec. 5.3 |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions |
|
Sec. 3.1.1 |
PaLM-E: An Embodied Multimodal Language Model |
|
Sec. 3.1.3 |
Multimodal Data Curation via Object Detection and Filter Ensembles |
|
Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3 |
Sieve: Multimodal Dataset Pruning Using Image Captioning Models |
|
Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.1, Sec. 8.3.3 |
Towards a statistical theory of data selection under weak supervision |
|
Sec. 3.2.1, Sec. 5.3 |
𝐷2 Pruning: Message Passing for Balancing Diversity & Difficulty in Data Pruning |
|
Sec. 3.3 |
UIClip: A Data-driven Model for Assessing User Interface Design |
|
Sec. 3.1.1 |
CapsFusion: Rethinking Image-Text Data at Scale |
|
Sec. 3.1.2 |
Improving CLIP Training with Language Rewrites |
|
Sec. 1, Sec. 3.1.2, Sec. 5.2 |
OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation |
|
Sec. 4.4.2 |
A Decade's Battle on Dataset Bias: Are We There Yet? |
|
Sec. 3.2.2 |
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets |
|
Sec 3.2.4 |
Data Filtering Networks |
|
Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3 |
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning |
|
Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3 |
InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4 |
|
Sec. 3.2.1 |
Align and Attend: Multimodal Summarization with Dual Contrastive Losses |
|
Table 2 |
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? |
|
Table 2 |
Text-centric Alignment for Multi-Modality Learning |
|
Sec. 3.2.4 |
Noisy Correspondence Learning with Meta Similarity Correction |
|
Sec. 3.2.4 |
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos |
|
Sec. 4.2.2 |
Language-Image Models with 3D Understanding |
|
Sec. 4.2.2 |
Scaling Laws for Generative Mixed-Modal Language Models |
|
Sec. 1 |
BLINK: Multimodal Large Language Models Can See but Not Perceive |
|
Sec. 4.4.1, Table 2 |
Visual Hallucinations of Multi-modal Large Language Models |
|
Sec. 4.4.2, Sec. 5.3 |
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models |
|
Sec. 4.2.2 |
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought |
|
Sec. 3.1.1, Sec. 4.2.2, Sec. 5.1 |
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering |
|
Sec. 3.1.1, Sec. 4.2.2, Table 2 |
Visual Instruction Tuning |
|
Sec. 3.1.1 |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model |
|
Sec. 2.1, Sec. 3.1.1, Sec. 3.2.4, Sec. 4.1, Sec. 4.1.1, Sec. 4.1.3, Sec. 8.3.1, Table 2 |
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models |
|
Sec. 4.1.1 |
On the De-duplication of LAION-2B |
|
Sec 3.2.1 |
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding |
|
Sec. 3.1.1, Sec. 3.2.2 |
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark |
|
Sec. 4.1.3, Sec. 4.4.1, Table 2 |
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition |
|
Sec. 6.2 |
Data Augmentation for Text-based Person Retrieval Using Large Language Models |
|
Sec. 3.1.2, Sec. 5.2 |
Aligning Actions and Walking to LLM-Generated Textual Descriptions |
|
Sec. 3.1.2, Sec. 5.2 |
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction |
|
Sec. 3.1.2 |
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models |
|
Sec. 3.1.3 |
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability |
|
3.2.4 |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling |
|
Sec. 5.1 |
Probing Multimodal LLMs as World Models for Driving |
|
Sec. 3.1.1, Sec. 4.4.4 |
Unified Hallucination Detection for Multimodal Large Language Models |
|
Sec. 4.4.2, Sec. 5.2, Sec. 6.2, Table 2 |
Semdedup: Data-efficient learning at web-scale through semantic deduplication |
|
Sec. 3.2.1, Sec. 3.2.4, Sec. 8.3.3 |
Automated Multi-level Preference for MLLMs |
|
Sec. 4.1.3 |
Silkie: Preference distillation for large visual language models |
|
Sec. 4.1.3 |
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning |
|
Sec. 4.1.3, Table 2 |
M3it: A large-scale dataset towards multi-modal multilingual instruction tuning |
|
Table 2 |
Aligning Large Multimodal Models with Factually Augmented RLHF |
|
Sec. 4.1.3 |
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback |
|
Sec. 4.1.3 |
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback |
|
Sec. 4.1.3 |
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark |
|
Sec. 4.4.2, Sec. 5.4, Sec. 8.3.3, Sec. 8.3.4, Table 2 |
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI |
|
Sec. 4.4.3, Table 2 |
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought |
|
Sec. 4.4.4, Table 2 |
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image |
|
Sec. 4.3.1, Sec. 5.4 |
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models |
|
Sec. 4.3.1 |
Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts |
|
Sec. 4.3.1 |
Improving Multimodal Datasets with Image Captioning |
|
3.2.1, 3.2.4, 8.2.2, 8.3.3 |
Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System |
|
6.3 |
LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition |
|
6.2 |
PDFChatAnnotator: A Human-LLM Collaborative Multi-Modal Data Annotation Tool for PDF-Format Catalogs |
|
Sec. 5.2, Sec. 6.2 |
CiT: Curation in Training for Effective Vision-Language Data |
|
Sec. 2.1, Sec. 8.3.3 |
InstructPix2Pix: Learning to Follow Image Editing Instructions |
|
Sec. 5.1 |
Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study |
|
Sec. 6.4 |
ModelGo: A Practical Tool for Machine Learning License Analysis |
|
Sec. 4.3.2, Sec. 8.2.1 |
Scaling Laws of Synthetic Images for Model Training ... for Now |
|
Sec 4.1.1 |
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs |
|
Sec. 3.1.3 |
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V |
|
Sec. 4.1.1 |
Segment Anything |
|
Sec. 1, Sec. 8.3.1 |
AIM: Let Any Multi-modal Large Language Models Embrace Efficient In-Context Learning |
|
Sec 4.1.2 |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning |
|
Sec 4.1.2 |
All in an Aggregated Image for In-Image Learning |
|
Sec. 4.1.2 |
Panda-70m: Captioning 70m videos with multiple cross-modality teachers |
|
Table 2 |
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text |
|
Table 2 |
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning |
|
Table 2 |