Official PyTorch implementation of TRIS, from the following paper:
Referring Image Segmentation Using Text Supervision. ICCV 2023.
Fang Liu*, Yuhao Liu*, Yuqiu Kong, Ke Xu, Lihe Zhang, Baocai Yin, Gerhard Hancke, Rynson Lau
We recommend running the code using Pytorch 1.13.1 or higher version.
├── data/
| ├── train2014
| ├── refer
| | ├── refcocog
| | | ├── instances.json
| | | ├── refs(google).p
| | | ├── refs(umd).p
| | ├── refcoco
├── data/
| ├── referit
| | ├── annotations
| | | ├── train.pickle
| | | ├── test.pickle
| | ├── images
| | ├── masks
If you want to generate referit annotations by yourself, refer to MG for more details.
Note that we use mIoU to evaluate the accuracy of the generated masks.
- Create the
./weights
directory
mkdir ./weights
- Download model weights using github links below and put them in
./weights
.
ReferIt | RefCOCO | RefCOCO+ | G-Ref (Google) | G-Ref (UMD) | |
---|---|---|---|---|---|
Step-1 | weight | weight | weight | weight | weight |
Step-2 | weight | weight | weight | weight | weight |
- Shell for
G-Ref(UMD)
evaluation. Replacerefcocog
withrefcoco
, andumd
withunc
for RefCOCO dataset evaluation.
bash scripts/validate_stage1.sh
The output of the demo is saved in ./figs/
.
python demo.py --img figs/demo.png --text 'man on the right'
- Train Step1 network on
Gref (UMD)
dataset.
bash scripts/train_stage1.sh
- Validate and generate response maps on the Gref (UMD)
train
set, based on the proposed PRMS strategy (--prms
). The response maps are saved in./output/refcocog_umd/cam/
indicated by the args--cam_save_dir
.
## path to save response maps and pseudo labels
dir=./output
python validate.py --batch_size 1 --size 320 \
--dataset refcocog --splitBy umd --test_split train \
--max_query_len 20 --output ./weights/ --resume \
--pretrain stage1_refcocog_umd.pth --cam_save_dir $dir/refcocog_umd/cam/ \
--name_save_dir $dir/refcocog_umd --eval --prms --save_cam
- Train IRNet and generate pseudo masks.
cd IRNet
dir=../output
## single GPU
CUDA_VISIBLE_DEVICES=0 python run_sample_refer.py \
--voc12_root ../../../work/datasets/train2014 \
--cam_out_dir $dir/refcocog_umd/cam \
--ir_label_out_dir $dir/refcocog_umd/ir_label \
--ins_seg_out_dir $dir/refcocog_umd/ins_seg \
--cam_eval_thres 0.15 \
--work_space output_refer/refcocog_umd \
--train_list $dir/refcocog_umd/refcocog_train_names.json \
--num_workers 2 \
--irn_batch_size 24 \
--cam_to_ir_label_pass True \
--train_irn_pass True \
--make_ins_seg_pass True \
## the code can run faster if more GPUs are available
#CUDA_VISIBLE_DEVICES=0,1,2,3 python run_sample_refer.py --cam_out_dir $dir/refcocog_umd/cam --ir_label_out_dir $dir/refcocog_umd/ir_label --ins_seg_out_dir $dir/refcocog_umd/ins_seg --train_list $dir/refcocog_umd/refcocog_train_names.json --cam_eval_thres 0.15 --work_space output_refer/refcocog_umd --num_workers 8 --irn_batch_size 96 --cam_to_ir_label_pass True --train_irn_pass True --make_ins_seg_pass True
- Train Step2 network using the generated pseudo masks in
output/refcocog_umd/ins_seg
indicated by the args--pseudo_path
.
cd ../
bash scripts/train_stage2.sh
## python train_stage2.py --batch_size 48 --size 320 --dataset refcocog --splitBy umd --test_split val --bert_tokenizer clip --backbone clip-RN50 --max_query_len 20 --epoch 15 --pseudo_path output/refcocog_umd/ins_seg --output ./weights/stage2/pseudo_refcocog_umd
This repository was based on LAVT, WWbL, CLIMS and IRNet.
If you find this repository helpful, please consider citing:
@inproceedings{liu2023referring,
title={Referring Image Segmentation Using Text Supervision},
author={Liu, Fang and Liu, Yuhao and Kong, Yuqiu and Xu, Ke and Zhang, Lihe and Yin, Baocai and Hancke, Gerhard and Lau, Rynson},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={22124--22134},
year={2023}
}
If you have any questions, please feel free to reach out at [email protected]
.