This paper has been published in the journal "Knowledge-Based Systems".
- Title: Image Paragraph Captioning with Topic Clustering and Topic Shift Prediction
- Authors: Ting Tang, Jiansheng Chen, Yiqing Huang, Huimin Ma, Yudong Zhang, Hongwei Yu
- Publication Date: 2024/1/18
- Journal: Knowledge-Based Systems
The paper can be accessed and downloaded via the following link: Download Paper
Image paragraph captioning involves generating a semantically coherent paragraph describing an image’s visual content. The selection and shifting of sentence topics are critical when a human describes an image. However, previous hierarchical image paragraph captioning methods have not fully explored or utilized sentence topics. In particular, the continuous and implicit modeling of topics in these methods makes it difficult to supervise the topic prediction process explicitly. We propose a new method called topic clustering and topic shift prediction (TCTSP) to solve this problem. Topic clustering (TC) in the sentence embedding space generates semantically explicit and discrete topic labels that can be directly used to supervise topic prediction. By introducing a topic shift probability matrix that characterizes human topic shift patterns, topic shift prediction (TSP) predicts subsequent topics that are both logical and consistent with human habits based on visual features and language context. TCTSP can be combined with various image paragraph captioning model structures to improve performance. Extensive experiments were conducted on the Stanford image paragraph dataset, and superior results were reported compared with previous state-of-the-art approaches. In particular, TCTSP improved the consensus-based image description evaluation (CIDEr) performance of image paragraph captioning to 41.67%. The codes are available at
For citing this paper, please use the following format:
title = {Image paragraph captioning with topic clustering and topic shift prediction},
journal = {Knowledge-Based Systems},
volume = {286},
pages = {111401},
year = {2024},
issn = {0950-7051},
doi = {},
url = {},
author = {Ting Tang and Jiansheng Chen and Yiqing Huang and Huimin Ma and Yudong Zhang and Hongwei Yu}
The codebase is tested under the following environment settings:
- cuda: 10.1
- numpy 1.19.5
- python: 3.6.13
- pytorch: 1.4.0
- torchvision: 0.5.0
- coco-caption (put pycocoevalcap under path TCTSP/)
For more detailed environment settings, please refer to TCTSP/environment.yml:
conda env create -f environment.yml
We have extracted the features of the images in the Stanford image paragraph dataset using Faster R-CNN and uploaded them. The way to get them is as follows:
Download res101_10_100_ray.tar.gz from:
Extract to the TCTSP/ directory using the following command:
tar -xzvf res101_10_100_ray.tar.gz
The rest of the data needed for the experiment is stored in data_vg.tar.gz and uploaded, and the method to obtain is as follows:
Download data_vg.tar.gz from:
Extract to the TCTSP/ directory using the following command:
tar -xzvf data_vg.tar.gz
Our pre-trained model is obtained in the following way:
Download caption_model_57.pth from:
Make a snapshot folder:
mkdir ./experiments/Xlan_SAP_V6_kmeans_wt03_RL_wt05_CIDEr_25_test/snapshot/
- Put caption_model_57.pth under path TCTSP/experiments/Xlan_SAP_V6_kmeans_wt03_RL_wt05_CIDEr_25_test/snapshot/
In image paragraph captioning task, we only compute BLEU, METEOR and CIDEr, so other metrics in line 47 of TCTSP/pycocoevalcap/ need to be delete.
To conduct evaluation of the pre-trained model, you can run the following command:
CUDA_VISIBLE_DEVICES=0 python --folder ./experiments/Xlan_SAP_V6_kmeans_wt03_RL_wt05_CIDEr_25_test --resume 57 --markov_mat_path ./data/markov_mat_kmeans.npy
Part of the code is borrowed from image-captioning. We thank the authors for releasing their codes.