Skip to content
/ VIR-VLFM Public

The offical code for paper "Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding"

Notifications You must be signed in to change notification settings

lxn96/VIR-VLFM

Repository files navigation

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Xiaonan Lu, Jianlong Yuan*, Ruigang Niu, Yuan Hu, Fan Wang

DAMO Academy, Alibaba Group

Paper: Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Overview

  • VIR-VLFM is the first attempt to enhance the multi-image understanding ability of vision language foundation models, enabling them to be applied to image change understanding.
  • In VIR-VLFM, a fused adapter image encoder is devised to bridge the gap between image encoder pre-training and ICU. Besides, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the severe performance degradation caused by viewpoint variations.
  • Extensive experiments on CLEVR-Change and Spot-the-Diff illustrate that our method achieves state-of-the-art performance in image change caption on all metrics and shows promising results in change question answering.

overview

Getting Started

Installation

1. Prepare the code and the environment

Git clone our repository and creat a python environment

git clone https://github.com/lxn96/VIR-VLFM
cd VIR-VLFM
conda env create -f environment.yml
conda activate vir_vlfm

2. Prepare Vicuna weights

Please refer to instruction to prepare the Vicuna-7B weights. The structure is as follows:

vicuna_weights_7b
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00002.bin
...   

Then, set the path to the vicuna weight in the model config file at Line 15.

3. Prepare minigpt4 pretrained weights for llama projection layer

Download the pretrained checkpoints. Then, set the path to the pretrained checkpoint in the training config file at Line 13 for training.

4. Prepare the checkpoint for VIR-VLFM on CLEVR-Change

Download the checkpoint for VIR-VLFM. Then, set the path to the pretrained checkpoint in the evaluation config file at Line 14 for evaluation.

Evaluation

For evaluation, run test.py as follows:

python test.py --img1 test_images/image1_1.png --img2 test_images/image1_2.png --cfg-path eval_configs/vir_vlfm_eval.yaml  --gpu-id 0

Training

For training, first prepare the CLEVR-Change dataset. Then, set the path to the pretrained checkpoint in the dataset config file. The structure of CLEVR-Change dataset is as follows:

dataset
├── clevr_change
    ├── images
    ├── nsc_images
    ├── sc_images
    ├── splits.json
    ├── change_captions.json
    ├── no_change_captions.json
    ...   

And run train.py with training config file as follows:

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/vir_vlfm_clevr_change.yaml

Acknowledgement

  • BLIP-2 The model architecture of our VIR-VLFM follows BLIP-2.
  • Vicuna The large-scale language model applied in our VIR-VLFM is Vicuna-7B.
  • MiniGPT-4 This repository is built upon minigpt4.

Citation

If you're using VIR-VLFM in your research or applications, please cite using this BibTeX:

@article{lu2023viewpoint,
  title={Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding},
  author={Lu, Xiaonan and Yuan, Jianlong and Niu, Ruigang and Hu, Yuan and Wang, Fan},
  journal={arXiv preprint arXiv:2309.08585},
  year={2023}
}

About

The offical code for paper "Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages