README.md

2D Network for Video Recognition

We uniformly sample 4/8/16 frames for num_segments_L, num_segments_M and num_segments_H during training, and use num_segments_H to specify the number of frames during inference.
We enable Any-Frame-Inference for 2D network so that the model can be evaluated at frames which are not used in training.
We use 1-clip 1-crop evaluation for 2D network with the resolution of 224x224.
lambda_act denotes the coefficient $\lambda$ in the loss function and we set it as 1 without further fine-tuning the hyperparameter.
We train 2D network TSM, TEA with 2 NVIDIA Tesla V100 (32GB) cards and the model is pretrained on ImageNet.

Specify the directory of datasets with ROOT_DATASET in ops/dataset_config.py.

Simply run the training scripts in exp as followed:

bash exp/tsm_sthv1/run.sh  ## baseline training
bash exp/tsm_sthv1_FFN/run.sh   ## FFN training

Specify the directory of datasets with ROOT_DATASET in ops/dataset_config.py.
Please download pretrained models from Google Drive.
Specify the directory of the pretrained model with resume in test.sh.

Run the inference scripts in exp as followed:

bash exp/tsm_sthv1/test.sh  ## baseline inference
bash exp/tsm_sthv1_FFN/test.sh   ## FFN inference