V2L-MSVD

Generating video descriptions using deep learning in Keras

Start with AWS Ubuntu Deep Learning AMI on a EC2 p2.xlarge instance. (or better, p2.xlarge costs $0.9/hour on-demand and ~$0.3/hour as a spot instance)

source activate tensorflow_p27
conda install scikit-learn
conda install scikit-image

If you are not using AWS, ensure you have a recent version of Keras and Tensorflow installed and working, and also install scikit-learn and scikit-image if you want to train tag prediction models

git clone https://github.com/rohit-gupta/V2L-MSVD.git
cd V2L-MSVD

Using a pre-trained video captioning model

Use a video from YouTube

bash fetch-pretrained-model.sh
sudo bash install-youtube-dl.sh
bash fetch-youtube-video.sh https://www.youtube.com/watch?v=cKWuNQAy2Sk
bash process-youtube-video.sh

Use a video from your local disk

bash fetch-pretrained-model.sh
bash fetch-from-localpath.sh /home/ubuntu/vid1.mp4
bash process-youtube-video.sh

Training your own video captioning model

Download data: should take about 2 minutes

bash fetch-data.sh

Preprocess text data: ETA ~5 minutes

If you only want to use Verified descriptions ->

bash preprocess-data.sh CleanOnly

If you want to use both verified and unverified descriptions ->

bash preprocess-data.sh

Extract frames from the Videos: ETA ~30 minutes

bash extract_frames.sh

Extract Video Features: ETA ~15 Minutes

bash run-feature-extractor.sh

Tag Model: ETA ~5 Minutes

bash train-simple-tag-prediction-model.sh

Train Language Model: ETA ~50 minutes (Can be killed around ~25 minutes after 5 Epochs)

bash train-language-model.sh

Score Language Model: ETA ~5 minutes

bash score-language-model.sh

Known Issues

If at any stage you get an error that contains

/lib/libstdc++.so.6: version `CXXABI_1.3.x' not found

You can fix it with:

cd ~/anaconda3/envs/tensorflow_p27/lib && mv libstdc++.a stdcpp_bkp && mv libstdc++.so stdcpp_bkp && mv libstdc++.so.6 stdcpp_bkp && mv libstdc++.so.6.0.19 stdcpp_bkp/  && mv libstdc++.so.6.0.19-gdb.py stdcpp_bkp/  && mv libstdc++.so.6.0.21 stdcpp_bkp/  && mv libstdc++.so.6.0.24 stdcpp_bkp/ && cd -

Tensorflow 1.3 has a memory leak bug that might affect this code

You can fix it by upgrading Tensorflow.

Reference for this problem: #3

Results

The video captioning model here uses Mean Pooled ResNet50 features of video frames along with Object, Action and Attribute tags predicted by a simple feedforward network.

The Table below compares the performance of our model with some other models that also rely on mean pooled frame features. It is sourced from papers 1, 2 and 3.

Model	METEOR score on MSVD
Mean Pooled (AlexNet Features)	26.9
Mean Pooled (VGG Features)	27.7
Mean Pooled (GoogleNet Features)	28.7
Ours (Mean Pooled ResNet50 Features + Predicted Tags)	29.0

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
action_classifier		action_classifier
advanced_tag_models		advanced_tag_models
attribute_classifier		attribute_classifier
deploy		deploy
entity_classifier		entity_classifier
frame_features		frame_features
language_model		language_model
single_frame		single_frame
tag_generator		tag_generator
.gitignore		.gitignore
IDEAS.md		IDEAS.md
README.md		README.md
Rohit_Gupta_Thesis__Public_Copy_.pdf		Rohit_Gupta_Thesis__Public_Copy_.pdf
extract_frame_gen.py		extract_frame_gen.py
fetch-data.sh		fetch-data.sh
fetch-from-localpath.sh		fetch-from-localpath.sh
fetch-pretrained-model.sh		fetch-pretrained-model.sh
fetch-youtube-video.sh		fetch-youtube-video.sh
install-youtube-dl.sh		install-youtube-dl.sh
langmodel.png		langmodel.png
preprocess-data.sh		preprocess-data.sh
process-youtube-video.sh		process-youtube-video.sh
run-feature-extractor.sh		run-feature-extractor.sh
score-language-model.sh		score-language-model.sh
train-language-model.sh		train-language-model.sh
train-simple-tag-prediction-model.sh		train-simple-tag-prediction-model.sh
video_fps.csv		video_fps.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

V2L-MSVD

Using a pre-trained video captioning model

Use a video from YouTube

Use a video from your local disk

Training your own video captioning model

Download data: should take about 2 minutes

Preprocess text data: ETA ~5 minutes

Extract frames from the Videos: ETA ~30 minutes

Extract Video Features: ETA ~15 Minutes

Tag Model: ETA ~5 Minutes

Train Language Model: ETA ~50 minutes (Can be killed around ~25 minutes after 5 Epochs)

Score Language Model: ETA ~5 minutes

Known Issues

Results

About

Releases

Packages

Languages

rohit-gupta/Video2Language

Folders and files

Latest commit

History

Repository files navigation

V2L-MSVD

Using a pre-trained video captioning model

Use a video from YouTube

Use a video from your local disk

Training your own video captioning model

Download data: should take about 2 minutes

Preprocess text data: ETA ~5 minutes

Extract frames from the Videos: ETA ~30 minutes

Extract Video Features: ETA ~15 Minutes

Tag Model: ETA ~5 Minutes

Train Language Model: ETA ~50 minutes (Can be killed around ~25 minutes after 5 Epochs)

Score Language Model: ETA ~5 minutes

Known Issues

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages