open-mmlab · techmonsterwang · Jul 2, 2023 · Jul 2, 2023 · Jul 3, 2023 · Jul 3, 2023
diff --git a/configs/instructblip/README.md b/configs/instructblip/README.md
@@ -0,0 +1,53 @@
+# MiniGPT4
-# MiniGPT4
+# InstructBLIP
-# MiniGPT4
+# InstructBLIP
+
+> [InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning](https://arxiv.org/abs/2305.06500)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although
+vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced.
+
+<div align=center>
+<img src="https://github.com/open-mmlab/mmpretrain/assets/48375204/4211e0d8-951f-48d0-b81d-34be2e777390" width="80%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+from mmpretrain import inference_model
+
+result = inference_model('instructblip-vicuna7b_3rdparty-zeroshot_caption', 'demo/cat-dog.png')
+print(result)
+# {'pred_caption': 'a blanket next to each other in the grass\na cute puppy and kitten wallpapers'}
+```
+
+<!-- [TABS-END] -->
+
+## Models and results
+
+For Vicuna model, please refer to [MiniGPT-4 page](https://github.com/Vision-CAIR/MiniGPT-4) for preparation guidelines.
+
+### Pretrained models
+
+| Model                                               | Params (M) | Flops (G) |                      Config                      |                                      Download                                      |
+| :-------------------------------------------------- | :--------: | :-------: | :----------------------------------------------: | :--------------------------------------------------------------------------------: |
+| `instructblip-vicuna7b_3rdparty-zeroshot_caption`\* |  8121.32   |    N/A    | [config](instructblip-vicuna7b_8xb32_caption.py) | [model](https://download.openmmlab.com/mmclassification/v1/instructblip/instruct-blip_vicuna7b_trimmed.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip). The config files of these models are only for inference. We haven't reproduce the training results.*
+
+## Citation
+
+```bibtex
+@article{dai2023instructblip,
+  title={InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning},
+  author={Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony Meng Huat and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale and Hoi, Steven},
+  journal={arXiv preprint arXiv:2305.06500},
+  year={2023}
+}
+```
diff --git a/configs/instructblip/instructblip-vicuna7b_8xb32_caption.py b/configs/instructblip/instructblip-vicuna7b_8xb32_caption.py
@@ -0,0 +1,77 @@
+_base_ = [
+    '../_base_/datasets/coco_caption.py',
+    '../_base_/default_runtime.py',
+]
+
+# model settings
+model = dict(
+    type='InstructBlipCaption',
+    llm_tokenizer=dict(
+        type='LlamaTokenizer',
+        name_or_path=
+        '/mnt/petrelfs/share_data/liuyuan/llm_weights/vicuna_weights_7b'),
+    vision_encoder=dict(
+        type='BEiTViT',
+        # eva-g without the final layer
+        arch=dict(
+            embed_dims=1408,
+            num_layers=39,
+            num_heads=16,
+            feedforward_channels=6144,
+        ),
+        img_size=224,
+        patch_size=14,
+        out_indices=-2,
+        layer_scale_init_value=0.0,
+        use_abs_pos_emb=True,
+        use_rel_pos_bias=False,
+        frozen_stages=39,
+        final_norm=False,
+        use_shared_rel_pos_bias=False,
+        out_type='raw',
+        pretrained=  # noqa
+        'https://download.openmmlab.com/mmpretrain/v1.0/minigpt4/minigpt-4_eva-g-p14_20230615-e908c021.pth'  # noqa
+    ),
+    text_backbone=dict(
+        type='AutoModelForCausalLM',
+        name_or_path=
+        '/mnt/petrelfs/share_data/liuyuan/llm_weights/vicuna_weights_7b'),
+    Qformer=dict(
+        type='Qformer',
+        model_style='bert-base-uncased',
+        vision_model_width=1408,
+        add_cross_attention=True,
+        cross_attention_freq=2,
+        num_query_token=32),
+    prompt='Write a short description for the image.',
+    max_txt_len=30)
+
+# schedule settings
+optim_wrapper = dict(optimizer=dict(type='AdamW', lr=1e-5, weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='CosineAnnealingLR',
+        by_epoch=True,
+        begin=0,
+        end=10,
+    )
+]
+
+train_cfg = dict(max_epochs=10)
+val_cfg = dict()
+test_cfg = dict()
+
+# dataset settings
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='Resize',
+        scale=(224, 224),
+        interpolation='bicubic',
+        backend='pillow'),
+    dict(type='PackInputs', meta_keys=['image_id']),
+]
+
+val_dataloader = dict(dataset=dict(pipeline=test_pipeline))
+test_dataloader = val_dataloader
diff --git a/configs/instructblip/metafile.yml b/configs/instructblip/metafile.yml
@@ -0,0 +1,33 @@
+Collections:
+  - Name: InstructBLIP
+    Metadata:
+      Training Data:
+        - COCO
+        - VG
+        - CC3M
+        - CC12M
+        - SBU
+        - LAION-400M
+      Architecture:
+        - Transformer
+        - Q-Former
+    Paper:
+      Title: 'InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning'
+      URL: https://arxiv.org/abs/2305.06500
+    README: configs/instructblip/README.md
+
+Models:
+  - Name: instructblip-vicuna7b_3rdparty-zeroshot_caption
+    Metadata:
+      FLOPs: null
+      Parameters: xxx
+    In Collection: InstructBLIP
+    Results:
+      - Task: Image Caption
+        Dataset: COCO
+        Metrics: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/instructblip/instruct-blip_vicuna7b_trimmed.pth
+    Config: configs/instructblip/instructblip-vicuna7b_8xb32_caption.py
+    Converted From:
+      Weights: https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth
+      Code: https://github.com/salesforce/LAVIS
diff --git a/mmpretrain/models/multimodal/__init__.py b/mmpretrain/models/multimodal/__init__.py
@@ -6,6 +6,7 @@
     from .blip2 import *  # noqa: F401,F403
     from .chinese_clip import *  # noqa: F401, F403
     from .flamingo import *  # noqa: F401, F403
+    from .instructblip import *  # noqa: F401,F403
     from .llava import *  # noqa: F401, F403
     from .minigpt4 import *  # noqa: F401, F403
     from .ofa import *  # noqa: F401, F403
@@ -17,5 +18,6 @@
     register_multimodal_placeholder([
         'Blip2Caption', 'Blip2Retrieval', 'Blip2VQA', 'BlipCaption',
         'BlipNLVR', 'BlipRetrieval', 'BlipGrounding', 'BlipVQA', 'Flamingo',
-        'OFA', 'ChineseCLIP', 'MiniGPT4', 'Llava', 'Otter'
+        'OFA', 'ChineseCLIP', 'InstructBlipCaption', 'MiniGPT4', 'Llava',
+        'Otter'
     ], MODELS)
diff --git a/mmpretrain/models/multimodal/instructblip/__init__.py b/mmpretrain/models/multimodal/instructblip/__init__.py
@@ -0,0 +1,4 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from .instructblip_caption import InstructBlipCaption
+
+__all__ = ['InstructBlipCaption']