Skip to content

MCG-NJU/VideoMAE-Action-Detection

Repository files navigation

VideoMAE for Action Detection (NeurIPS 2022 Spotlight) [Arxiv]

VideoMAE Framework

License: CC BY-NC 4.0
PWC

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab

This repo contains the supported code and scripts to reproduce action detection results of VideoMAE. The code of pre-training is available in original repo.

📰 News

[2023.1.16] Code and pre-trained models are available now!

🚀 Main Results

✨ AVA 2.2

Method Extra Data Extra Label Backbone #Frame x Sample Rate mAP
VideoMAE Kinetics-400 ViT-S 16x4 22.5
VideoMAE Kinetics-400 ViT-S 16x4 28.4
VideoMAE Kinetics-400 ViT-B 16x4 26.7
VideoMAE Kinetics-400 ViT-B 16x4 31.8
VideoMAE Kinetics-400 ViT-L 16x4 34.3
VideoMAE Kinetics-400 ViT-L 16x4 37.0
VideoMAE Kinetics-400 ViT-H 16x4 36.5
VideoMAE Kinetics-400 ViT-H 16x4 39.5
VideoMAE Kinetics-700 ViT-L 16x4 36.1
VideoMAE Kinetics-700 ViT-L 16x4 39.3

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

☎️ Contact

Zhan Tong: [email protected]

👍 Acknowledgements

Thanks to Lei Chen for support. This project is built upon MAE-pytorch, BEiT and AlphAction. Thanks to the contributors of these great codebases.

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}