Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Xiaonan Lu, Jianlong Yuan*, Ruigang Niu, Yuan Hu, Fan Wang

DAMO Academy, Alibaba Group

Paper: Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Overview

VIR-VLFM is the first attempt to enhance the multi-image understanding ability of vision language foundation models, enabling them to be applied to image change understanding.
In VIR-VLFM, a fused adapter image encoder is devised to bridge the gap between image encoder pre-training and ICU. Besides, a viewpoint registration flow and a semantic emphasizing module are designed to reduce the severe performance degradation caused by viewpoint variations.
Extensive experiments on CLEVR-Change and Spot-the-Diff illustrate that our method achieves state-of-the-art performance in image change caption on all metrics and shows promising results in change question answering.

Getting Started

Installation

1. Prepare the code and the environment

Git clone our repository and creat a python environment

git clone https://github.com/lxn96/VIR-VLFM
cd VIR-VLFM
conda env create -f environment.yml
conda activate vir_vlfm

2. Prepare Vicuna weights

Please refer to instruction to prepare the Vicuna-7B weights. The structure is as follows:

vicuna_weights_7b
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00002.bin
...

Then, set the path to the vicuna weight in the model config file at Line 15.

3. Prepare minigpt4 pretrained weights for llama projection layer

Download the pretrained checkpoints. Then, set the path to the pretrained checkpoint in the training config file at Line 13 for training.

4. Prepare the checkpoint for VIR-VLFM on CLEVR-Change

Download the checkpoint for VIR-VLFM. Then, set the path to the pretrained checkpoint in the evaluation config file at Line 14 for evaluation.

Evaluation

For evaluation, run test.py as follows:

python test.py --img1 test_images/image1_1.png --img2 test_images/image1_2.png --cfg-path eval_configs/vir_vlfm_eval.yaml  --gpu-id 0

Training

For training, first prepare the CLEVR-Change dataset. Then, set the path to the pretrained checkpoint in the dataset config file. The structure of CLEVR-Change dataset is as follows:

dataset
├── clevr_change
    ├── images
    ├── nsc_images
    ├── sc_images
    ├── splits.json
    ├── change_captions.json
    ├── no_change_captions.json
    ...

And run train.py with training config file as follows:

torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/vir_vlfm_clevr_change.yaml

Acknowledgement

BLIP-2 The model architecture of our VIR-VLFM follows BLIP-2.
Vicuna The large-scale language model applied in our VIR-VLFM is Vicuna-7B.
MiniGPT-4 This repository is built upon minigpt4.

Citation

If you're using VIR-VLFM in your research or applications, please cite using this BibTeX:

@article{lu2023viewpoint,
  title={Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding},
  author={Lu, Xiaonan and Yuan, Jianlong and Niu, Ruigang and Hu, Yuan and Wang, Fan},
  journal={arXiv preprint arXiv:2309.08585},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Overview

Getting Started

Installation

Evaluation

Training

Acknowledgement

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
eval_configs		eval_configs
figs		figs
prompts		prompts
test_images		test_images
train_configs		train_configs
vir_vlfm		vir_vlfm
PrepareVicuna.md		PrepareVicuna.md
README.md		README.md
environment.yml		environment.yml
test.py		test.py
train.py		train.py

lxn96/VIR-VLFM

Folders and files

Latest commit

History

Repository files navigation

Viewpoint Integration and Registration with Vision Language Foundation Model for Image Change Understanding

Overview

Getting Started

Installation

Evaluation

Training

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages