This repository is the official implementation of ActionLLM. In this study, we introduce the ActionLLM, which leverages Large Language Models (LLMs) to anticipate long-term actions by treating video sequences as successive tokens. By simplifying the model architecture and incorporating a Cross-Modality Interaction Block, it enhances multimodal semantic understanding and achieves superior performance on benchmark datasets. Paper from https://arxiv.org/abs/2501.00795.
- Conda environment settings:
conda env export > actionllm.yaml
conda activate actionllm
- Download the datasets from dataset .
- Download the LLaMA-7b from https://huggingface.co/nyanko7/LLaMA-7B/tree/main .
- Download the text_feature from https://pan.baidu.com/s/1nXMxt9-IrxGt-zvC1JV9XQ?pwd=iana .
Create a directory './data' for the two datasets , text feature and LLaMA-7B. Please ensure the data structure is as below:
βββ data/
βββ 50_salads/
β βββ groundTruth/
β βββ features/
β βββ mapping.txt
β βββ splits/
βββ breakfast/
β βββ groundTruth/
β βββ features/
β βββ mapping.txt
β βββ splits/
βββ text_feature/
β βββ breakfast/
β βββ 50_salads/
βββ weights/
βββ 7B/
βββ checklist.chk
βββ consolidated.00.pth
βββ params.json
βββ ...
- Please modify the address information in the .sh file and opts.py file according to your file location.
./scripts/bf/train_bf.sh
./scripts/50s/train_50s.sh
- Download the checkpoint from https://pan.baidu.com/s/1P41BeTtxTebJP0OHXHXUSw?pwd=iana
./scripts/bf/eval_bf.sh
./scripts/50s/eval_50s.sh
If you find our code or paper useful, please consider citing our paper:
@article{wang2025actionllm,
title={Multimodal Large Models Are Effective Action Anticipators},
author={Wang, Binglu and Tian, Yao and Wang, Shunzhou and Yang, Le}
journal={IEEE Transactions on Multimedia},
year={2025},
publisher={IEEE}
}
This repo borrows some data and codes from LLaMA, FUTR and LaVIN. Thanks for their great works.