This code implements the prediction of individualized visual scanpath in three different tasks (4 different datasets) with two different architecture:
- Free-viewing: the prediction of scanpath for looking at some salient or important object in the given image. (OSIE, OSIE-ASD)
- Visual Question Answering: the prediction of scanpath during human performing general tasks, e.g., visual question answering, to reflect their attending and reasoning processes. (AiR-D)
- Visual search: the prediction of scanpath during the search of the given target object to reflect the goal-directed behavior. (COCO-Search18)
Understanding how attention varies across individuals has significant scientific and societal impacts. However, existing visual scanpath models treat attention uniformly, neglecting individual differences. To bridge this gap, this paper focuses on individualized scanpath prediction (ISP), a new attention modeling task that aims to accurately predict how different individuals shift their attention in diverse visual tasks. It proposes an ISP method featuring three novel technical components: (1) an observer encoder to characterize and integrate an observer's unique attention traits, (2) an observer-centric feature integration approach that holistically combines visual features, task guidance, and observer-specific characteristics, and (3) an adaptive fixation prioritization mechanism that refines scanpath predictions by dynamically prioritizing semantic feature maps based on individual observers' attention traits. These novel components allow scanpath models to effectively address the attention variations across different observers. Our method is generally applicable to different datasets, model architectures, and visual tasks, offering a comprehensive tool for transforming general scanpath models into individualized ones. Comprehensive evaluations using value-based and ranking-based metrics verify the method's effectiveness and generalizability.
For the ScanMatch evaluation metric, we adopt the part of GazeParser
package.
We adopt the implementation of SED and STDE from VAME
as two of our evaluation metrics mentioned in the Visual Attention Models
.
More specific, we adopt the evaluation metrics provided in Scanpath
.
For ChemLSTM and Gazeformer, we adopt the released code in Scanpath
and Gazeformer
, respectively.
Based on the checkpoint
implementation from updown-baseline
, we slightly modify it to accommodate our pipeline.
-
Python 3.9
-
PyTorch 1.12.1 (along with torchvision)
-
We also provide the conda environment
user_scanpath.yml
, you can directly run
$ conda env create -f user_scanpath.yml
to create the same environment where we successfully run our codes.
We provide the corresponding codes for the aforementioned four different datasets and the pretrained models.
- OSIE
- OSIE-ASD
- COCOSearch
- AiR-D
More details of these tasks are provided in their corresponding folders.
If you use our code or data, please cite our paper:
@InProceedings{xianyu:2024:individualscanpath,
author={Xianyu Chen and Ming Jiang and Qi Zhao},
title = {Beyond Average: Individualized Visual Scanpath Prediction},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2024}
}