Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Xiaonan Lu, Jianlong Yuan*, Ruigang Niu, Yuan Hu, Fan Wang

DAMO Academy, Alibaba Group

Paper: Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Overview

VIR-VLFM is the first attempt to enhance the multi-image understanding ability of vision language foundation models, enabling them to be applied to image change understanding.
In VIR-VLFM, a fused adapter image encoder is devised to bridge the gap between image encoder pre-training and ICU. Besides, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the severe performance degradation caused by viewpoint variations.
Extensive experiments on CLEVR-Change and Spot-the-Diff illustrate that our method achieves state-of-the-art performance in image change caption on all metrics and shows promising results in change question answering.

Getting Started

Installation

1. Prepare the code and the environment

Git clone our repository and creat a python environment

git clone https://github.com/lxn96/VIR-VLFM
cd VIR-VLFM
conda env create -f environment.yml
conda activate vir_vlfm

2. Prepare Vicuna weights

Please refer to instruction to prepare the Vicuna-7B weights. The structure is as follows:

vicuna_weights_7b
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00002.bin
...

Then, set the path to the vicuna weight in the model config file at Line 15.

3. Prepare minigpt4 pretrained weights for llama projection layer

Download the pretrained checkpoints. Then, set the path to the pretrained checkpoint in the training config file at Line 13 for training.

4. Prepare the checkpoint for VIR-VLFM on CLEVR-Change

Download the checkpoint for VIR-VLFM. Then, set the path to the pretrained checkpoint in the evaluation config file at Line 14 for evaluation.

Evaluation

For evaluation, run test.py as follows:

python test.py --img1 test_images/image1_1.png --img2 test_images/image1_2.png --cfg-path eval_configs/vir_vlfm_eval.yaml  --gpu-id 0

Training

For training, first prepare the CLEVR-Change dataset. Then, set the path to the pretrained checkpoint in the dataset config file. The structure of CLEVR-Change dataset is as follows:

dataset
├── clevr_change
    ├── images
    ├── nsc_images
    ├── sc_images
    ├── splits.json
    ├── change_captions.json
    ├── no_change_captions.json
    ...

And run train.py with training config file as follows:

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/vir_vlfm_clevr_change.yaml

Acknowledgement

BLIP-2 The model architecture of our VIR-VLFM follows BLIP-2.
Vicuna The large-scale language model applied in our VIR-VLFM is Vicuna-7B.
MiniGPT-4 This repository is built upon minigpt4.

Citation

If you're using VIR-VLFM in your research or applications, please cite using this BibTeX:

@article{lu2023viewpoint,
  title={Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding},
  author={Lu, Xiaonan and Yuan, Jianlong and Niu, Ruigang and Hu, Yuan and Wang, Fan},
  journal={arXiv preprint arXiv:2309.08585},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Overview

Getting Started

Installation

Evaluation

Training

Acknowledgement

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Overview

Getting Started

Installation

Evaluation

Training

Acknowledgement

Citation