Very Deep Convolutional Networks for Large-Scale Image Recognition

Problem

ILSVRC (ImageNet Large Scale Visual Recognition Challenge) - Image classification for over a million images into 1000 object classes.

Key points

Authors demonstrate the importance of depth in CNNs for effective classification and other image recognition tasks.
No use of normalization layers as compared to AlexNet, because of smaller filter size.
Preference of depth with smaller filters over shallower networks with larger filters. 3 layers with 3x3 filters have an effective receptive field of 7*7 with 81% less parameters than one filter with size 7x7.
Incorporate 1x1 convoloutional layers just to increase non-linearilty. Peforms better than nets with no such layer but worse than nets using 3x3 filters instead.
Pre-initialization of weights using shallower nets.
Two approaches for training:
- Single scale - Rescaling images so that shorter side (S) is 256 or 384, and then cropping 224x224 patches from it. During test time,
- Multi scale - Rescaling images so that shorter side (S) lies in [256, 512], to better capture the multi-scale nature of objects.
Two approaches for testing:
- Single scale - Scale shorter side of test image(Q) = S if S is fixed, and Q = 0.5 * (Smin + Smax) when multi-scale training is used.
- Multi scale - Q = {S - 32, S, S + 32} if S is fixed, else Q = {Smin, 0.5 * (Smin + Smax), Smax}. Gives best results.
Dense and multi-crop evaluation:
- Dense - FCN layers are converted to CNNs and uncropped image is passed. Scores are then averaged for this image and its flip.
- Multi-crop - Multiple crops of test image are taken and their scores averaged.
- Works best when both are combined.

Results

2nd in ILSVRC 2014 classification task behind GoogLeNet in net performance.
Best performance if using only 1 network as opposed to ensemble.
1st in ILSVRC 2015 localization task.
Learned VGG representations generalize well on other datasets and tasks as well.

Notes

Simple to understand network architecture, with pre-trained models being able to generalize well on other datasets.
Very heavy model with around 138M parameters, not suitable for actual deployment.

Provide feedback