Skip to content

Latest commit

 

History

History

Spatial-Temporal-Action-Localization

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

VideoMAE Installation

Environment Configuration

The codebase is mainly built with following libraries:

  • Python 3.6 or higher
  • PyTorch and torchvision.
    We can successfully reproduce the main results in two settings:
    Tesla A100 (40G): CUDA 11.1 + PyTorch 1.8.0 + torchvision 0.9.0 Tesla V100 (32G): CUDA 10.1 + PyTorch 1.6.0 + torchvision 0.7.0
    The torch version here has a great impact on the results. It is recommended to configure the environment according to such settings or a newer version.
  • timm==0.4.8/0.4.12
  • deepspeed==0.5.8
  • TensorboardX
  • decord
  • einops
  • av
  • tqdm


We recommend to setup the environment with Anaconda, the step-by-step installation script is shown below.

conda create -n VideoMAE_ava python=3.7
conda activate VideoMAE_ava

#install pytorch with the same cuda version as in your environment
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

conda install av -c conda-forge
conda install cython

Data Preparation

AVA2.2

The code combines VideoMAE and Alphaction, and the preparation of AVA data refers to the data preparation of Alphaction. If you only need to train and test on the AVA dataset, you do not need to prepare the Kinetics dataset.

Kinetics

Other_files

video_map.npy : Mapping of video id and corresponding video path

ak_val_gt.csv : The ground truth of the val-set of ava-kinetics

AVA-Kinetics file

In order to facilitate everyone to download together, we have organized the annotation files we used, which are available for download in OneDrive. It should be noted that the files we use may be slightly different from the officially provided files, especially for kinetics, the annotations we use The version may be older, and some videos may be different from what you are downloading now.

Train

Here is a script that uses the ava-kinetics dataset for training and eval on the ava dataset

MODEL_PATH='YOUR_PATH/PRETRAIN_MODEL.pth'
OUTPUT_DIR='YOUR_PATH/OUTPUT_DIR'
python -m torch.distributed.launch --nproc_per_node=8 \
      --master_port 12320 --nnodes=8 \
      --node_rank=0 --master_addr=$ip_node_0 \
      run_class_finetuning.py \
      --model vit_large_patch16_224 \
      --finetune ${MODEL_PATH} \
      --log_dir ${OUTPUT_DIR} \
      --output_dir ${OUTPUT_DIR} \
      --batch_size 8 \
      --update_freq 1 \
      --num_sample 1 \
      --input_size 224 \
      --save_ckpt_freq 1 \
      --num_frames 16 \
      --sampling_rate 4 \
      --opt adamw \
      --lr 0.00025 \
      --opt_betas 0.9 0.999 \
      --weight_decay 0.05 \
      --epochs 30 \
      --data_set "ava-kinetics" \
      --enable_deepspeed \
      --val_freq 30 \
      --drop_path 0.2\
  • SLURM ENV
MODEL_PATH='YOUR_PATH/PRETRAIN_MODEL.pth'
OUTPUT_DIR='YOUR_PATH/OUTPUT_DIR'
PARTITION=${PARTITION:-"video"}
GPUS=${GPUS:-32}
GPUS_PER_NODE=${GPUS_PER_NODE:-8}
CPUS_PER_TASK=${CPUS_PER_TASK:-12}
SRUN_ARGS=${SRUN_ARGS:-""}
PY_ARGS=${@:2}
srun -p video \
     --gres=gpu:${GPUS_PER_NODE} \
     --ntasks=${GPUS} \
     --ntasks-per-node=${GPUS_PER_NODE} \
     --cpus-per-task=${CPUS_PER_TASK} \
     ${SRUN_ARGS} \
     python -u run_class_finetuning.py \
      --model vit_large_patch16_224 \
      --finetune ${MODEL_PATH} \
      --log_dir ${OUTPUT_DIR} \
      --output_dir ${OUTPUT_DIR} \
      --batch_size 8 \
      --update_freq 1 \
      --num_sample 1 \
      --input_size 224 \
      --save_ckpt_freq 1 \
      --num_frames 16 \
      --sampling_rate 4 \
      --opt adamw \
      --lr 0.00025 \
      --opt_betas 0.9 0.999 \
      --weight_decay 0.05 \
      --epochs 30 \
      --data_set "ava" \
      --enable_deepspeed \
      --val_freq 30 \
      --drop_path 0.2\
      ${PY_ARGS}

Eval

  • SLURM ENV
DATA_PATH='YOUR_PATH/list_kinetics-400'   #it can be any string in our task
MODEL_PATH='YOUR_PATH/PRETRAIN_MODEL.pth'
OUTPUT_DIR='YOUR_PATH/OUTPUT_DIR'
PARTITION=${PARTITION:-"video"}
GPUS=${GPUS:-32}
GPUS_PER_NODE=${GPUS_PER_NODE:-8}
CPUS_PER_TASK=${CPUS_PER_TASK:-12}
SRUN_ARGS=${SRUN_ARGS:-""}
PY_ARGS=${@:2}
srun -p video \
     --gres=gpu:${GPUS_PER_NODE} \
     --ntasks=${GPUS} \
     --ntasks-per-node=${GPUS_PER_NODE} \
     --cpus-per-task=${CPUS_PER_TASK} \
     ${SRUN_ARGS} \
     python -u run_class_finetuning.py \
      --model vit_large_patch16_224 \
      --data_path ${DATA_PATH} \
      --finetune ${MODEL_PATH} \
      --log_dir ${OUTPUT_DIR} \
      --output_dir ${OUTPUT_DIR} \
      --batch_size 4 \
      --update_freq 1 \
      --num_sample 1 \
      --input_size 224 \
      --save_ckpt_freq 1 \
      --num_frames 16 \
      --sampling_rate 4 \
      --opt adamw \
      --lr 0.00025 \
      --opt_betas 0.9 0.999 \
      --weight_decay 0.05 \
      --epochs 30 \
      --data_set "ava" \
      --enable_deepspeed \
      --val_freq 30 \
      --drop_path 0.2\
      --eval \
      ${PY_ARGS}