Skip to content

wanboyang/IASGVD_ICASSP2022

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Informative Attention Supervision for Grounded Video Description

This repo hosts the source code for our paper Informative Attention Supervision for Grounded Video Description. It supports ActivityNet-Entities dataset.

Quick Start

Preparations

Follow the instructions 1 to 3 in the Requirements section to install required packages.

Download everything

Simply run the following command to download all the data and pre-trained models (total 216GB):

bash tools/download_all.sh

Starter code

Run the following eval code to test if your environment is setup:

python main.py --batch_size 20 --cuda --checkpoint_path save/gvd_starter --id gvd_starter --language_eval

You can now skip to the Training and Validation section!

Requirements (Recommended)

  1. Clone the repo recursively:

Make sure all the submodules densevid_eval and coco-caption are included.

  1. Rebuilding the environment via Anaconda.
conda env create -f environment.yaml

3 (Optional) If you choose to not use download_all.sh, be sure to install JAVA and download Stanford CoreNLP for SPICE (see here). Also, download and place the reference file under coco-caption/annotations. Download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools directory.

Data Preparation

Updates on 04/15/2020: Feature files for the hidden test set, used in ANet-Entities Object Localization Challenge 2020, are available to download (region features and frame-wise features). Make sure you move the additional *.npy files over to your folder fc6_feat_100rois and rgb_motion_1d, respectively. The following files have been updated to include the hidden test set or video IDs: anet_detection_vg_fc6_feat_100rois.h5, anet_entities_prep.tar.gz, and anet_entities_captions.tar.gz.

Download the preprocessed annotation files from here, uncompress and place them under data/anet. Or you can reproduce them all using the data from ActivityNet-Entities repo and the preprocessing script prepro_dic_anet.py under prepro. Then, download the ground-truth caption annotations (under our val/test splits) from here and same place under data/anet.

The region features and detections are available for download (feature and detection). The region feature file should be decompressed and placed under your feature directory. We refer to the region feature directory as feature_root in the code. The H5 region detection (proposal) file is referred to as proposal_h5 in the code. To extract feature for customized dataset (or brave folks for ANet-Entities as well), refer to the feature extraction tool here.

The frame-wise appearance (with suffix _resnet.npy) and motion (with suffix _bn.npy) feature files are available here. We refer to this directory as seg_feature_root.

Other auxiliary files, such as the weights from Detectron fc7 layer, are available here. Uncompress and place under the data directory.

Training and Validation

Modify the config file cfgs/anet_res101_vg_feat_10x100prop_ip.yml with the correct dataset and feature paths (or through symlinks). Link tools/anet_entities to your ANet-Entities dataset root location. Create new directories log and results under the root directory to save log and result files.

CUDA_VISIBLE_DEVICES=1,0 python main.py --path_opt cfgs/anet_res101_vg_feat_10x100prop_ip.yml  --batch_size 20 --cuda --checkpoint_path save/topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --id topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --language_eval --w_att2 0.1 --w_grd 0 --w_cls 0.1 --obj_interact --overlap_type Both --att_model topdown --learning_rate 2e-4 --densecap_verbose --loss_type both3 --acc_num 4 --iou_thresh 0.5 --iop_thresh 0.9 --mGPUs | tee log/topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4

(Optional) Remove --mGPUs to run in single-GPU mode.

Inference and Testing

For supervised models (ID=topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4):

CUDA_VISIBLE_DEVICES=1 python main.py --path_opt  cfgs/anet_res101_vg_feat_10x100prop_ip.yml --batch_size 20 --cuda --num_workers 6 --max_epoch 50 --inference_only --start_from ./save/topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --id topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --val_split validation  --densecap_verbose --seq_length 20 --language_eval --obj_interact --eval_obj_grounding  --grd_reference ./tools/anet_entities/data/anet_entities_cleaned_class_thresh50_test_skeleton.json --eval_obj_grounding_gt| tee log/eval-testing_split-topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4-beam1-standard-inference
CUDA_VISIBLE_DEVICES=1 python main.py --path_opt  cfgs/anet_res101_vg_feat_10x100prop_ip.yml --batch_size 20 --cuda --num_workers 6 --max_epoch 50 --inference_only --start_from ./save/topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --id topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4 --val_split testing  --densecap_verbose --seq_length 20 --language_eval --obj_interact --eval_obj_grounding  --grd_reference ./tools/anet_entities/data/anet_entities_cleaned_class_thresh50_test_skeleton.json --eval_obj_grounding_gt| tee log/eval-testing_split-topdown_iou_iop_cls_attn_both3loss_w_att2_0.1_cuda11_accnum2e4-beam1-standard-inference

Arguments: dc_references='./data/anet/anet_entities_val_1.json ./data/anet/anet_entities_val_2.json', grd_reference='tools/anet_entities/data/anet_entities_cleaned_class_thresh50_trainval.json' val_split='validation'.

You need at least 9GB of free GPU memory for the evaluation.

Reference

Please acknowledge the following paper if you use the code:

@inproceedings{wan2022informative,
  title={Informative Attention Supervision for Grounded Video Description},
  author={Wan, Boyang and Jiang, Wenhui and Fang, Yuming},
  booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1955--1959},
  year={2022},
  organization={IEEE}
}

Acknowledgement

We thank project Grounded Video Description.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Portions of the source code are based on the Grounded Video Description.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published