Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, and Limin Wang
This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding, including a simple yet efficient framework to process long video sequence, a high-quality video dataset for grounded tuning of MLLMs, and a carefully-designed instruction tuning task to explicitly incorporate the grounding supervision in the traditional QA format.
State-of-the-art performance: VideoChat-T demonstrates high performance for both long-form video question answering and temporal grounding.
Highly efficient model architecture with exceptional inference speed, encoding each video frame into just 3 tokens, leading to the flops of our VideoChat-T are 5.1% of Llava-OneVision
High-quality data
- We introduced the comprehensive dataset TimePro, which includes 9 task types with video sources from 15 different datasets.
- We designed a novel Temporal Grounded Caption fine-tuning task to effectively mitigate hallucinations in MLLM.
- 2025.02.12 TimeSuite is now initially open-sourced. We welcome everyone to try it out!
- 2025.01.23 TimeSuite has been accepted by ICLR 2025.
- 2024.10.25 The paper of TimeSuite has been uploaded to arXiv.
- Create a new environment and run the command to install the necessary dependencies.
conda create --name TimeSuite
conda activate TimeSuite
pip install -r requirements.txt
-
Download the model and code of TimeSuite from https://huggingface.co/Lanxingxuan/TimeSuite to the
./download
folder. (Please note that you need to additionally download Mistral-7B-Instruct-v0.2 to./download/parameters
) -
Search for all instances of
/path_to_the_timesuite_root_folder
and replace them with the directory of the TimeSuite root folder. -
Please search for all video dataset paths containing
s3://
and replace them with the corresponding video dataset paths on your server.
-
Run
demo/demo.ipynb
to see the demo provided in the paper, or try out the videos and questions of your choice. -
Run
eval/eval_qa_tasks.ipynb
to test the general QA performance of the model. -
To test the temporal grounding capability of TimeSuite, please follow these two steps.
bash eval/test_grounding.sh
bash eval/get_grounding_result.sh
- Please properly configure the video dataset path in
configs/instruction_data.py
. - Modify
scripts/videochat_mistral/config_LinearP.py
andscripts/videochat_mistral/config_LinearProAda.py
to adjust the model training parameter settings. - Please run
bash scripts/videochat_mistral/run_7b_stage4.sh
to initiate the fine-tuning of the model. - To reproduce the fine-tuning results presented in the paper, you need to initiate the model training in a two-stage manner. For detailed parameter settings, please refer to Appendix D of the paper.
- All data used for fine-tuning is now open-sourced. Please visit here to download.
If you find this project useful in your research, please consider cite:
@misc{zeng2024timesuite,
title={TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning},
author={Xiangyu Zeng and Kunchang Li and Chenting Wang and Xinhao Li and Tianxiang Jiang and Ziang Yan and Songze Li and Yansong Shi and Zhengrong Yue and Yi Wang and Yali Wang and Yu Qiao and Limin Wang},
year={2024},
eprint={2410.19702},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.19702},
}
Thanks to the open source of the following projects: