This is a PyTorch module that does a feature extraction in parallel on any number of GPUs. So far, I3D and VGGish features are supported.
This repository contains some changes to accommodate running the dependencies for the BMT repository without the need for
a GPU with 8GB VRAM available (needed for I3D feature extraction). As such, we introduce the --nocuda
flag, to run everything on the CPU.
Let's first consider the pros and cons of running with the --nocuda
flag.
Pros:
- Allows to run the feature extraction without the need for a powerful GPU.
Cons:
- The implementation introduces a third party dependency.
- The used
CrossCorrelationSampler
implementation provides slightly different outputs w.r.t the original implementation. - The CPU based extraction is not particularly fast. On a Nvidia K80 (Google Collab gpu) a 20 second video is processed in 20 seconds with I3D, while the CPU counterpart takes roughly 167 seconds on an 18 second video
Creating a virtual environment following the commands below should allow your to start extracting features with only the CPU.
python -m venv venv
source venv/bin/activate
pip install -r -U cpu_requirements.txt
It may be that you need to install the third party dependency by Clement Pinard manually. To do so, run the following commands, or check the repository to see how to install the package.
git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git
cd Pytorch-Correlation-extension
python -m setup.py install
After that has completed you can run the feature extraction as described below, but with CPU only! Happy hacking!
I3D (optical flow)
python main.py --feature_type i3d --file_with_video_paths ./sample/sample_video_paths.txt --nocuda
VGGish (audio) Make sure to setup the PyTorch network as described in the following Setup the Environment for VGGish (Pytorch)
python main.py --feature_type vggish --file_with_video_paths ./sample/sample_video_paths.txt --nocuda
Please note, this implementation uses PWC-Net instead of the TVL1 algorithm, which was used in the original I3D paper as PWC Net is much faster. Yet, one may create a Pull Request implementing TVL1 as an option to form optical flow frames.
Setup conda
environment. Requirements are in file conda_env_i3d.yml
# it will create new conda environment called 'i3d' on your machine
conda env create -f conda_env_i3d.yml
conda activate i3d
It will extract I3D features for sample videos using 0th and 2nd devices in parallel. The features are going to be extracted with the default parameters. Check out python main.py --help
for help on available options.
python main.py --feature_type i3d --device_ids 0 2 --video_paths ./sample/v_ZNVhz7ctTq0.mp4 ./sample/v_GGSY1Qvo990.mp4
The video paths can be specified as a .txt
file with paths
python main.py --feature_type i3d --device_ids 0 2 --file_with_video_paths ./sample/sample_video_paths.txt
The features can be saved as numpy arrays by specifying --on_extraction save_numpy
. By default, it will create a folder ./output
and will store features there
python main.py --feature_type i3d --device_ids 0 2 --on_extraction save_numpy --file_with_video_paths ./sample/sample_video_paths.txt
You can change the output folder using --output_path
argument.
Also, you may want to try to change I3D window and step sizes
python main.py --feature_type i3d --device_ids 0 2 --stack_size 24 --step_size 24 --file_with_video_paths ./sample/sample_video_paths.txt
By default, the frames are extracted according to the original fps of a video. If you would like to extract frames at a certain fps, specify --extraction_fps
argument.
python main.py --feature_type i3d --device_ids 0 2 --extraction_fps 25 --stack_size 24 --step_size 24 --file_with_video_paths ./sample/sample_video_paths.txt
If --keep_frames
is specified, it keeps them in --tmp_path
which is ./tmp
by default. Be careful with the --keep_frames
argument when playing with --extraction_fps
as it may mess up the frames you extracted before in the same folder.
- An implementation of PWC-Net in PyTorch: https://github.com/sniklaus/pytorch-pwc
- A port of I3D weights from TensorFlow to PyTorch: https://github.com/hassony2/kinetics_i3d_pytorch
- The I3D paper: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.
All is MIT except for PWC Net implementation in I3D. Please read the PWC implementation License (Last time I checked it was GPL-3.0).
The extraction of VGGish features can be extracted either by using a Wrapped version (original) or Pytorch re-implementation that uses a ported version.
Setup conda
environment. Requirements are in the file conda_env_vggish_pytorch.yml
.
# it will create new conda environment called 'vggish' on your machine
conda env create -f conda_env_vggish_pytorch.yml
conda activate vggish
To download the models, follow the instructions in the submodules README, or generate your own version.
The only difference during execution w.r.t. the original implementation, is that the --pytorch
flag must be set,
otherwise the Tensorflow implementation is used.
python main.py --pytorch --feature_type vggish --device_ids 0 2 --video_paths ./sample/v_ZNVhz7ctTq0.mp4 ./sample/v_GGSY1Qvo990.mp4
See python main.py --help
for more arguments and I3D examples
Setup conda
environment. Requirements are in file conda_env_vggish.yml
# it will create new conda environment called 'vggish' on your machine
conda env create -f conda_env_vggish.yml
conda activate vggish
# download the pre-trained VGGish model. The script will put the files in the checkpoint directory
wget https://storage.googleapis.com/audioset/vggish_model.ckpt -P ./models/vggish/checkpoints
python main.py --feature_type vggish --device_ids 0 2 --video_paths ./sample/v_ZNVhz7ctTq0.mp4 ./sample/v_GGSY1Qvo990.mp4
See python main.py --help
for more arguments and I3D examples
- The TensorFlow implementation.
- The VGGish paper: CNN Architectures for Large-Scale Audio Classification.
My code (this wrapping) is under MIT but the tf implementation complies with the tensorflow
license which is Apache-2.0.