Stars
OpenMMLab Detection Toolbox and Benchmark
[ECCV 2024] official code for "Long-CLIP: Unlocking the Long-Text Capability of CLIP"
SEED-Story: Multimodal Long Story Generation with Large Language Model
Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
🦜🔗 Build context-aware reasoning applications
[ICML'24 Oral] "MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions"
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
High-Resolution Image Synthesis with Latent Diffusion Models
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
[NeurIPS 2024 Best Paper][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ult…
An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Enriching MS-COCO with Chinese sentences and tags for cross-lingual multimedia tasks
Densely Captioned Images (DCI) dataset repository.
(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Anserini is a Lucene toolkit for reproducible information retrieval research
Open source implementation of "Vision Transformers Need Registers"
COLA: Evaluate how well your vision-language model can Compose Objects Localized with Attributes!
Official code for paper "UniIR: Training and Benchmarking Universal Multimodal Information Retrievers" (ECCV 2024)
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding