This code implements the Self-Distillation for Few-Shot Image Captioning.
If you use our code or data, please cite our paper:
@InProceedings{xianyu:2021:sd-fsic,
author = {Xianyu Chen and Ming Jiang and Qi Zhao},
title = {Self-Distillation for Few-Shot Image Captioning},
journal = {Winter Conference on Applications of Computer Vision (WACV)},
year = {2021}
}
We adopt the pytorch implementation for Self-critical Sequence Training for Image Captioning self-critical.pytorch
as a baseline model for few-shot image captioner. We use the features provided in this repository. Please refer to these links for further README information.
- Python 2 or 3 (coco-caption supports python 3)
- PyTorch 1.3 (along with torchvision)
- cider (cider) (Download it in current folder
SD-FSIC/
) - coco-caption (coco-caption) (Remember to follow initialization steps in coco-caption/README.md) (Download it in current folder
SD-FSIC/
) - yacs
- I also provide the conda enviroment sc_rtl.yml, you can directly run
$ conda env create -f sc_rtl.yml
to create the same enviroment where I succesfully run my code.
One can follow the instructions in data/README.md to create the corresponding data. More specifically, we download the preprocessed file or preextracted features from link.
You need to download as least the following files, unzip them and put them in the data
folder.
- coco-train-idxs.p
- coco-train-words.p
- cocotalk_label.h5
- cocotalk.json
- dataset_coco.json
- cocotalk_fc.zip
- cocotalk_att.zip
$ python train.py --cfg configs/fc.yml --id sd-fsic
The train script will dump checkpoints into the folder specified by --checkpoint_path
(default = log_$id/
). You can set the corresponding hyper-parameters in configs/fc.yml
.
--paired_percentage
The percentage of the training set, where the images and sentences are paired.--language_pretrain_epoch
The number of epochs used to pretrain the model.--paired_train_epoch
The number of epochs used to train the model with image-caption pairs.--random_seed
The seed used to select the image-caption pairs.--alpha
The smoothing coefficient for Mean Teacher.--hyper_parameter_lambda_x
The hyper-parameter to balance the unsupervised items for total loss.--hyper_parameter_lambda_y
The hyper-parameter to balance the unsupervised items for total loss.--std_pseudo_visual_feature
The hyper-parameter for the standard deviation of pseudo visual feature.--number_of_models
The number of base models for model ensemble.--inner_iteration
The number of the total iteration of the inner optimization to generate pseudo latent feature.
We provide the corresponding results of the COCO test set in sd-fsic.json
.
Furthermore, we also provide the pretrained model for the above results. You can download this pretrained model log_sd-fsic.zip
and unzip it to current folder SD-FSIC/
. Then you can run the following script to evaluate our model on Karpathy's test split.
$ python multi_eval_ensemble.py \
--dump_images 0 \
--num_images 5000 \
--beam_size 5 \
--language_eval 1 \
--model log_sd-fsic/model-best.pth \
--infos_path log_sd-fsic/infos_sd-fsic-best.pkl
The corresponding results are listed below
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE_L | CIDEr | SPICE | WMD |
---|---|---|---|---|---|---|---|---|
64.5 | 45.9 | 32.1 | 22.5 | 20.0 | 46.7 | 62.4 | 12.7 | 14.7 |