Problem
ILSVRC (ImageNet Large Scale Visual Recognition Challenge) - Image classification for over a million images into 1000 object classes.
Key points
- Authors demonstrate the importance of depth in CNNs for effective classification and other image recognition tasks.
- No use of normalization layers as compared to AlexNet, because of smaller filter size.
- Preference of depth with smaller filters over shallower networks with larger filters. 3 layers with 3x3 filters have an effective receptive field of 7*7 with 81% less parameters than one filter with size 7x7.
- Incorporate 1x1 convoloutional layers just to increase non-linearilty. Peforms better than nets with no such layer but worse than nets using 3x3 filters instead.
- Pre-initialization of weights using shallower nets.
- Two approaches for training:
- Single scale - Rescaling images so that shorter side (S) is 256 or 384, and then cropping 224x224 patches from it. During test time,
- Multi scale - Rescaling images so that shorter side (S) lies in [256, 512], to better capture the multi-scale nature of objects.
- Two approaches for testing:
- Single scale - Scale shorter side of test image(Q) = S if S is fixed, and Q = 0.5 * (Smin + Smax) when multi-scale training is used.
- Multi scale - Q = {S - 32, S, S + 32} if S is fixed, else Q = {Smin, 0.5 * (Smin + Smax), Smax}. Gives best results.
- Dense and multi-crop evaluation:
- Dense - FCN layers are converted to CNNs and uncropped image is passed. Scores are then averaged for this image and its flip.
- Multi-crop - Multiple crops of test image are taken and their scores averaged.
- Works best when both are combined.
Results
- 2nd in ILSVRC 2014 classification task behind GoogLeNet in net performance.
- Best performance if using only 1 network as opposed to ensemble.
- 1st in ILSVRC 2015 localization task.
- Learned VGG representations generalize well on other datasets and tasks as well.
Notes
- Simple to understand network architecture, with pre-trained models being able to generalize well on other datasets.
- Very heavy model with around 138M parameters, not suitable for actual deployment.