Some codes in this repo are copied/modified from opensource implementations made available by UNITER, PyTorch, HuggingFace, OpenNMT, and Nvidia. The image features are extracted using BUTD.
This is following UNITER. We provide Docker image for easier reproduction. Please install the following:
- nvidia driver (418+),
- Docker (19.03+),
- nvidia-container-toolkit.
Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.
bash scripts/download_itm.sh $PATH_TO_STORAGE
The new txt_db file in https://drive.google.com/drive/folders/1ZOK3jlcgGRifz8iw2-5vL89rIJcoYU3D?usp=sharing. Please download the txt_db file to replace the original one.
# docker image should be automatically pulled
source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
$PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained
In case you would like to reproduce the whole preprocessing pipeline.
The launch script respects $CUDA_VISIBLE_DEVICES environment variable.
Note that the source code is mounted into the container under /src
instead
of built into the image so that user modification will be reflected without
re-building the image. (Data folders are mounted into the container separately
for flexibility on folder structures.)
# Train wit the base setting
bash run_cmds/tran_pnsgd_base_flickr.sh
bash run_cmds/tran_pnsgd2_base_flickr.sh
# Train wit the large setting
bash run_cmds/tran_pnsgd_large_flickr.sh
bash run_cmds/tran_pnsgd2_large_flickr.sh
# Train wit the base setting
bash run_cmds/tran_pnsgd_base_coco.sh
bash run_cmds/tran_pnsgd2_base_coco.sh
# Train wit the large setting
bash run_cmds/tran_pnsgd_large_coco.sh
bash run_cmds/tran_pnsgd2_large_coco.sh
bash run_cmds/inf_nsgd.sh
Our models achieve the following performance.
Model | Image-to-Text | Text-to-Image | ||||
---|---|---|---|---|---|---|
R@1 | R@5 | R@110 | R@1 | R@5 | R@10 | |
NSGDC-Base | 66.6 | 88.6 | 94.0 | 51.6 | 79.1 | 87.5 |
NSGDC-Large | 67.8 | 89.6 | 94.2 | 53.3 | 80.0 | 88.0 |
Model | Image-to-Text | Text-to-Image | ||||
---|---|---|---|---|---|---|
R@1 | R@5 | R@110 | R@1 | R@5 | R@10 | |
NSGDC-Base | 87.9 | 98.1 | 99.3 | 74.5 | 93.3 | 96.3 |
NSGDC-Large | 90.6 | 98.8 | 99.1 | 77.3 | 94.3 | 97.3 |