Models | MLLM Architecture | GitHub Stars | Huggingface Download |
---|---|---|---|
LLaVA-v1.5-13B | Pretrained Vision Encoder + Projector + LLM | 15.4K | 333.7K |
LVIS-Instruct4v-LLaVA-7B | Pretrained Vision Encoder + Projector + LLM | 122 | 5 |
MiniGPT-v2 | Pretrained Vision Encoder + Projector + LLM | 24.7K | / |
LLaVA-v1.5-7B | Pretrained Vision Encoder + Projector + LLM | 15.4K | 703K |
LLaVA-v1.6-Vicuna-7B | Pretrained Vision Encoder + Projector + LLM | 15.4K | 1.2M |
LLaVA-v1.6-Vicuna-13B | Pretrained Vision Encoder + Projector + LLM | 15.4K | 100.1K |
LLaVA-v1.6-34B | Pretrained Vision Encoder + Projector + LLM | 15.4K | 592.8K |
Yi-VL-6B | Pretrained Vision Encoder + Projector + LLM | 7K | 17.2K |
ALLaVA | Pretrained Vision Encoder + Projector + LLM | 134 | 93 |
kosmos2 | Pretrained Vision Encoder + Grounded LLM | 18.1K | 29.2K |
LWM | Pretrained Vision Encoder + Projector + Long-Context LLM | 6.6K | / |
BLIP2-Flan-T5-XL | Query tokens + LM | 8.5K | 35.4K |
Qwen-Vl-Chat | Query tokens + LLM | 3.4K | 289.9K |
InstructBLIP-Vicuna-13B | Query tokens + LLM | 8.5K | 5.4K |
mPLUG-Owl2 | Query tokens + LLM with Modality-Adaptive Module | 1.9K | 9.7K |
Cheetor | Query tokens + VPG-C + LLM | 308 | / |
Fuyu-8B | Linear Vision Encoder + LLM | / | 17.9K |
SEED-LLaMA | VQ-based Vision Encoder + LLM | 445 | / |
OpenFlamingo | Perceiver Resampler + LLM with Gated Cross-Attention Layers | 3.4K | / |