Paper Summary: Very Deep Convolutional Networks for Large-Scale Image Recognition

8 min readJun 16, 2024

Simonyan, Karen, and Andrew Zisserman. arXiv preprint arXiv:1409.1556 (2014).

The paper introduces the famous VGG architecture whose deep features trained on ImageNet serve as a transfer learning baseline for many applications. The important concept here is the introduction of stacking of multiple 3x3 convolutions which are shown to be better than windows of larger sizes.

1. Introduction

The manuscript aims to explore the effect of network depth by fixing other parameters of the architecture and steadily increasing the depth of the network by adding 3x3 convolution filters in all layers. They explore the tasks of image classification as well as localization (beating Overfeat).

2. ConvNet Configurations

The input to the model consists of 224x224 RGB images which undergo mean subtraction. Rather than using relatively large receptive fields in the convolutional layers, the authors use very small 3x3 receptive fields which stride 1 and padding 1, alongside 2x2 max pooling with stride 2. These stack of convolutional layers are followed by 3 fully connected layers to give the final classification.

All the configurations used are outlined above. Local Response Normalisation from AlexNet is also used in one configuration to gauge the claim whether they help boost accuracy.

The authors also note that despite the large depth, the number of parameters in their network is not greater than the number of parameters in a more shallow net with larger convolutional width and receptive fields, for example Overfeat.

The authors state that a stack of two 3x3 conv layers has an effective receptive field of 5x5; three such layers have a 7x7 effective receptive field. The stacking provides more non-linearity to the network instead of a single one in a larger receptive field. This is accompanied by fewer parameters as well as fewer operations. In a way, it can be seen as imposing a regularisation on the 7x7 convolutional filters, forcing them to have a decomposition through the 3x3 filters with multiple non-linearity injected in between.

The incorporation of 1x1 convolutional layers is a way to increase non-linearity without affecting the receptive fields of the convolutional layers and have also been used previously in “Network in Network” and “GoogLeNet”.

3. Classification Framework

The training procedure follows that of AlexNet, with an interesting observation shown above.

The initialisation of the network A was done randomly and the subsequent configurations were intialised with the optimized weights of configuration A. The authors also note that after paper submission they found that this pre training was unnecessary since they could just use Xavier initialisation.

To obtain the fixed-size 224x224 input, the images were rescaled to size S (training scale) and randomly cropped with the fixed size, after which it underwent more data augmentation. The authors consider two approaches for setting the training scale S.

Fix S which corresponds to single-scale training. The authors used S = 256 and S = 384 (it was initialised with the weights pretrained with S = 256 to speed up training)
Perform multi-scale training where each training image is rescaled by randomly sampling S from a certain range of (S_min, S_max), where S_min = 256 and S_max = 512. Again, to speed up training, the model was initialised with weights used in fixed S = 384.

At test time, the image is first rescaled to size Q (test scale) and then the network is applied densely over the rescaled test image, that is, the fully-connected layers are converted to convolutional, which leads to a map with the number of channels equal to the number of classes. The class score map is then spatially averaged. Since the network is applied over the whole image, there is no need to sample multiple crops at test time which requires a network pass for each crop. However, the authors note that mulit-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions:

4. Classification Experiments

The authors present the image classification results achieved by their different configurations on the ImageNet dataset.

The authors begin by evaluating the performance of individual ConvNet models at a single scale. The first observation is that the local response normalisation does not improve on the model A with any normalisation layers, and thus, the authors do not use any normalisation in the deeper architectures.

The second observation is that the classification error decreases with the increased ConvNet depth. An interesting thing to note is that inspite of the same depth, model C (which contains 1x1 conv layers) performs worse than model D. This indicates that which the additional non-linearity does help (C performans better than B), it is important to capture spatial context (D is better than C). The error rates saturate at 19 layers, however, even deeper models might be beneficial for larger datasets.

Another interesting insight is shown above. We also note that scale jittering at training time leads to significantly better results than training on images with fixed scale, even though a single scale is used at test time. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.

Now the authors experiment with scale jittering at test time. Their results indicate that scale jittering at training and evaluation of test image at multiple scales leads to the best performance.

Next, the authors compare dense ConvNet evaluation with multi-crop evaluation. They also assess their combination by averaging their softmax outputs. They note that multiple crops performs slightly better than dense evaluation and the two approaches are complementary, since their combination outperforms each of them.

The authors now combine the outputs of several models by averaging their softmax class posteriors, which boosts their accuracy further. They form an ensemble of only two best performing models which leads to their best accuracy.

They also outperform previous models who achieved best results in ILSVRC-2012 and 2013, as well as, achieve comparable performance with GoogLeNet.

5. Localisation

This can be seen as a special case of object detection, where a single object bounding box should be predicted for each of the top-5 classes. The architecture used is configuration D which is kept similar to before with the last fully connected layer now predicting a bounding box location instead of class scores. This can be done in two ways:

Single Class Regression, where the bounding box prediction is shared across all classes and the model outputs 4 numbers.
Per-class Regression, where the bounding box prediction is class-specific and the model outputs 4 x number of classes.

The model is trained similarly to the classification model and is tested both on single scale as well dense application of the network. In this case, we obtain a set of bounding box predictions after which they use a post-processing merging technique used by Overfeat (merge spatially close predictions by averaging coordinates and then rate them based on class scores from a classification network).

The authors also run some initial experiments and determine that per-class regression works better and hence, conduct all subsequent experiments with that scheme. They experiment on the ILSVRC localisation dataset and win the 2014 challenge, using above configuration with densely computed bounding box predictions.

6. Generalisation of Very Deep Features

Now the authors evaluate their ConvNets, pretrained on ImageNet, as feature extractors on other, smaller datasets, using Net-D and Net-E. For transfer learning, the penultimate features are used which are aggregated across multiple locations and scales, and then combined with a linear SVM classifier, trained on the target dataset. As seen earlier, evaluation over multiple scales is beneficial, so the features are extracted over several scales Q. The resulting features can either be stacked (Allows a subsequent classifier to learn how to optimally combine image statistics over a range of scales) or pooled across scales (more computationally cheap).

The first experiment is on the image classification task of PASCAL VOC 2007 and 2012 benchmarks. The authors state that they found aggregating image descriptors computed at multiple scales by averaging performs similarly to the aggregation by stacking. The authors hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales so there is no particular scale-specific semantics which a classifier could exploit. Their methods set the new state of the art on PASCAL VOC datasets.

Next, they experiment with Caltech-101 and Caltech-256 image classification benchmarks. Unlike VOC, they found that stacking of descriptors computed over multiple scales performs better than averaging or max pooling. This can be explained by the fact that in Caltech images typically occupy the whole image, so multi-scale image features are semantically different, and stacking allows a classifier to exploit such scale-specific representations. Again, the model outperforms the state of the art.

In their last experiment, the authors evaluate their best-performing image-representation (stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task which consists of predicting an action class from a single image, given a bounding box of the person performing an action. The authors consider two training settings:

Computing the ConvNet features on the whole image and ignoring the provided bounding box
Computing the features on the whole image and on the provided bounding box and stacking them to obtain the final representation

Their representation achieves state of the art on this dataset, even without using the provided bounding boxes and the results are further improved by the combination.

7. Conclusion

In this work, the authors evaluate deep convolutional networks along with deep convolutional features to a variety of tasks and datasets, illustrating the importance of depth in achieving good representations. They achieve state of the art in multiple datasets as well as introduce the concept of stacking of multiple small convolutions.

8. Final Words

The paper is very well written and introduces one of the most well-known as well as simple architectures.

9. References

Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).