Informative Attention Supervision for Grounded Video Description

This repo hosts the source code for our paper Informative Attention Supervision for Grounded Video Description. It supports ActivityNet-Entities dataset.

Quick Start

Preparations

Follow the instructions 1 to 3 in the Requirements section to install required packages.

Download everything

Simply run the following command to download all the data and pre-trained models (total 216GB):

bash tools/download_all.sh

Starter code

Run the following eval code to test if your environment is setup:

python main.py --batch_size 20 --cuda --checkpoint_path save/gvd_starter --id gvd_starter --language_eval

You can now skip to the Training and Validation section!

Requirements (Recommended)

Clone the repo recursively:

Make sure all the submodules densevid_eval and coco-caption are included.

Rebuilding the environment via Anaconda.

conda env create -f environment.yaml

3 (Optional) If you choose to not use download_all.sh, be sure to install JAVA and download Stanford CoreNLP for SPICE (see here). Also, download and place the reference file under coco-caption/annotations. Download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools directory.

Data Preparation

Updates on 04/15/2020: Feature files for the hidden test set, used in ANet-Entities Object Localization Challenge 2020, are available to download (region features and frame-wise features). Make sure you move the additional *.npy files over to your folder fc6_feat_100rois and rgb_motion_1d, respectively. The following files have been updated to include the hidden test set or video IDs: anet_detection_vg_fc6_feat_100rois.h5, anet_entities_prep.tar.gz, and anet_entities_captions.tar.gz.

Download the preprocessed annotation files from here, uncompress and place them under data/anet. Or you can reproduce them all using the data from ActivityNet-Entities repo and the preprocessing script prepro_dic_anet.py under prepro. Then, download the ground-truth caption annotations (under our val/test splits) from here and same place under data/anet.

The region features and detections are available for download (feature and detection). The region feature file should be decompressed and placed under your feature directory. We refer to the region feature directory as feature_root in the code. The H5 region detection (proposal) file is referred to as proposal_h5 in the code. To extract feature for customized dataset (or brave folks for ANet-Entities as well), refer to the feature extraction tool here.

The frame-wise appearance (with suffix _resnet.npy) and motion (with suffix _bn.npy) feature files are available here. We refer to this directory as seg_feature_root.

Other auxiliary files, such as the weights from Detectron fc7 layer, are available here. Uncompress and place under the data directory.

Training and Validation

Modify the config file cfgs/anet_res101_vg_feat_10x100prop_ip.yml with the correct dataset and feature paths (or through symlinks). Link tools/anet_entities to your ANet-Entities dataset root location. Create new directories log and results under the root directory to save log and result files.

CUDA_VISIBLE_DEVICES=1,0 python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop_ip.yml  --batch_size 20 --cuda --checkpoint_path save/topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --id topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --language_eval --w_att2 0.1 --w_grd 0 --w_cls 0.1 --obj_interact --overlap_type Both --att_model topdown --learning_rate 2e-4 --densecap_verbose --loss_type both3 --acc_num 4 --iou_thresh 0.5 --iop_thresh 0.9 --mGPUs | tee log/topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4

(Optional) Remove --mGPUs to run in single-GPU mode.

Inference and Testing

For supervised models (ID=topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4):

CUDA_VISIBLE_DEVICES=1 python main.py --path_opt  cfgs/anet_res101_vg_feat_10x100prop_ip.yml --batch_size 20 --cuda --num_workers 6 --max_epoch 50 --inference_only --start_from ./save/topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --id topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --val_split validation  --densecap_verbose --seq_length 20 --language_eval --obj_interact --eval_obj_grounding  --grd_reference ./tools/anet_entities/data/anet_entities_cleaned_class_thresh50_test_skeleton.json --eval_obj_grounding_gt| tee log/eval-testing_split-topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4-beam1-standard-inference

CUDA_VISIBLE_DEVICES=1 python main.py --path_opt  cfgs/anet_res101_vg_feat_10x100prop_ip.yml --batch_size 20 --cuda --num_workers 6 --max_epoch 50 --inference_only --start_from ./save/topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --id topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --val_split testing  --densecap_verbose --seq_length 20 --language_eval --obj_interact --eval_obj_grounding  --grd_reference ./tools/anet_entities/data/anet_entities_cleaned_class_thresh50_test_skeleton.json --eval_obj_grounding_gt| tee log/eval-testing_split-topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4-beam1-standard-inference

Arguments: dc_references='./data/anet/anet_entities_val_1.json ./data/anet/anet_entities_val_2.json', grd_reference='tools/anet_entities/data/anet_entities_cleaned_class_thresh50_trainval.json' val_split='validation'.

You need at least 9GB of free GPU memory for the evaluation.

Reference

Please acknowledge the following paper if you use the code:

@inproceedings{wan2022informative,
  title={Informative Attention Supervision for Grounded Video Description},
  author={Wan, Boyang and Jiang, Wenhui and Fang, Yuming},
  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1955--1959},
  year={2022},
  organization={IEEE}
}

Acknowledgement

We thank project Grounded Video Description.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Portions of the source code are based on the Grounded Video Description.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
cfgs		cfgs
misc		misc
prepro		prepro
tools		tools
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
main.py		main.py
opts.py		opts.py
opts.pyc		opts.pyc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Informative Attention Supervision for Grounded Video Description

Quick Start

Preparations

Download everything

Starter code

Requirements (Recommended)

Data Preparation

Training and Validation

Inference and Testing

Reference

Acknowledgement

License

About

Releases

Packages

License

wanboyang/IASGVD_ICASSP2022

Folders and files

Latest commit

History

Repository files navigation

Informative Attention Supervision for Grounded Video Description

Quick Start

Preparations

Download everything

Starter code

Requirements (Recommended)

Data Preparation

Training and Validation

Inference and Testing

Reference

Acknowledgement

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages