Multimodal Large Models Are Effective Action Anticipators

This repository is the official implementation of ActionLLM. In this study, we introduce the ActionLLM, which leverages Large Language Models (LLMs) to anticipate long-term actions by treating video sequences as successive tokens. By simplifying the model architecture and incorporating a Cross-Modality Interaction Block, it enhances multimodal semantic understanding and achieves superior performance on benchmark datasets. Paper from https://arxiv.org/abs/2501.00795.

Illustrating the architecture of the proposed ActionLLM

Environmental setup

Conda environment settings:

conda env export > actionllm.yaml
conda activate actionllm

Data

Download the datasets from dataset .
Download the LLaMA-7b from https://huggingface.co/nyanko7/LLaMA-7B/tree/main .
Download the text_feature from https://pan.baidu.com/s/1nXMxt9-IrxGt-zvC1JV9XQ?pwd=iana .

Create a directory './data' for the two datasets , text feature and LLaMA-7B. Please ensure the data structure is as below:

    ├── data/                      
        ├── 50_salads/ 
        │   ├── groundTruth/
        │   ├── features/
        │   ├── mapping.txt
        │   └── splits/             
        ├── breakfast/ 
        │   ├── groundTruth/
        │   ├── features/
        │   ├── mapping.txt
        │   └── splits/                         
        ├── text_feature/ 
        │   ├── breakfast/
        │   └── 50_salads/  
        └── weights/ 
            └── 7B/      
                ├── checklist.chk
                ├── consolidated.00.pth
                ├── params.json
                └── ...

Training

Please modify the address information in the .sh file and opts.py file according to your file location.

1. Breakfast

./scripts/bf/train_bf.sh

2. 50salads

./scripts/50s/train_50s.sh

Testing

Download the checkpoint from https://pan.baidu.com/s/1P41BeTtxTebJP0OHXHXUSw?pwd=iana

1. Breakfast

./scripts/bf/eval_bf.sh

2. 50salads

./scripts/50s/eval_50s.sh

Examples

Citation

If you find our code or paper useful, please consider citing our paper:

@article{wang2025actionllm,
  title={Multimodal Large Models Are Effective Action Anticipators},
  author={Wang, Binglu and Tian, Yao and Wang, Shunzhou and Yang, Le}
  journal={IEEE Transactions on Multimedia},
  year={2025},
  publisher={IEEE}
}

Acknowledgement

This repo borrows some data and codes from LLaMA, FUTR and LaVIN. Thanks for their great works.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
__pycache__		__pycache__
actionllm		actionllm
clip		clip
scripts		scripts
util		util
README.md		README.md
actionllm.yml		actionllm.yml
engine.py		engine.py
eval.py		eval.py
example.png		example.png
framework.png		framework.png
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Large Models Are Effective Action Anticipators

Environmental setup

Data

Training

1. Breakfast

2. 50salads

Testing

1. Breakfast

2. 50salads

Examples

Citation

Acknowledgement

About

Releases

Packages

Languages

2tianyao1/ActionLLM

Folders and files

Latest commit

History

Repository files navigation

Multimodal Large Models Are Effective Action Anticipators

Environmental setup

Data

Training

1. Breakfast

2. 50salads

Testing

1. Breakfast

2. 50salads

Examples

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages