Paper Summary: Visualizing and Understanding Convolutional Networks
Zeiler, Matthew D., and Rob Fergus. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part I 13. Springer International Publishing, 2014.
This manuscript delves into the visualization of activation maps which allows the authors to develop model architectures (now commonly known as ZFNet) that outperform the current state-of-the-art. Published concurrently with the DeCAF paper, they also perform an ablation study demonstrating one of the first use cases of transfer learning.
1. Introduction
Visualizing a convolutional neural network is difficult after its first layer as the feature maps cannot be remapped back to the pixel space. Previous visualization techniques don’t capture every part of an input image to show which part directly affects which feature map. The authors aim to introduce a visualization technique that reveals the input stimuli that excite individual feature maps at any given layer in the model, using a “Deconvolutional Network”. This technique, when used as a diagnostic tool, helps them to develop better architectures. Lastly, they explore the generalization ability of convolutional features to other datasets (Caltech-101, Caltech-256 and PASCAL 2012).
2. Approach
Understanding the operation of a convolutional networks requires interpreting the feature activity in the intermediate layers. The authors present a method to map these activities back to the input pixel space, showing what input pattern originally caused a given activation in the feature maps. This is done using Deconvolutional Networks which can be though of as a convolutional network but in reverse. It has three components:
- Unpooling: The normal max pooling operation involves finding the maximum value in the kernel and then storing that as the output. This operation is non-invertible. However, we can store where we found the maximum value in the kernel and place the reconstructions from the layer above into the appropriate locations. This technique is often called “bed of nails” since all the other values are set to zero. An example of the same is shown below:
2. Rectification: To obtain valid feature reconstructions at each layer, they are passed through a ReLU function.
3. Filtering: The convolutional network uses learned filters to convolve the feature maps from the previous layers. To invert this, the deconvolutional network uses transposed versions of the same filters, applied to the rectified maps, to obtain the reconstruction.
To start, an image is passed to the convolutional network and the features are computed through the layers. To examine a given activation, all the other activations in the layer are set to zero and the feature maps are passed as input to the attached deconvolutional network. The above operations are then used to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is repeated till the pixel space is reached.
Since the unpooling operation is different for each image, the reconstruction obtained resembles a small piece of the original image, with structures weighted according to their contribution to the feature activation.
3. Training Details
The model architecture is based on AlexNet with modifications made using the above visualisation technique as a diagnostic tool. It trained on ImageNet 2012 with several data augmentation techniques using stochastic gradient descent. The exact details can be found in the paper.
The authors state that visualization of the first layer filters during training revealed that a few of them dominate, as shown below. This leads to training instability and hinders a balanced feature learning.
To combat this, they renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius, sort of like a regularization effect.
4. Convnet Visualization
Using the model described in the previous section, we now perform feature visualization using the deconvolutional network. They visualize the top 9 activations for each layer and project each separately down to the pixel space, alongside the corresponding image patches.
Here the maps seem to be mostly activated by texture (Row 4, Column 3), color (Row 3, Column 2) or edges (Row 4, Column 4).
Some structure is now being identified by the model, for example the text patterns in Row 2, Column 4.
Discriminate parts of the object can now be seen in many cases (for example, Layer 4, Row 1, Column 1). It usefulness as a diagnostic tool is apparent in the case of Layer 5, Row 1, Column 2, where the patches don’t seem to have anything in common but the activation maps reveal that they are getting activated by the grass in the background.
The figure below visualizes the progression of the strongest activation within a given feature map, during 70 epochs of training. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop after a considerable number of epochs.
The figure below shows 5 sample images being translated, rotated and scaled by varying degrees while looking at the changes in the feature vectors from the top and bottom layers of the model, relative to the untransformed feature. The authors note the small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer. The figure also shows that the network output is stable to translation and scaling but is not invariant to rotation (except for objects with rotational symmetry).
Using the visualization tool, the authors now improve the AlexNet model. They state that the first layer activations display a mix of extremely high and low frequency information with little coverage of the mid frequencies and also some “dead” activations. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride of the previous layer. To remedy this, they reduce the 1st layer filter size (from 11x11 to 7x7) and reduce the stride of the convolution (from 4 to 2). They later empirically demonstrate the improved performance of the model.
With image classification approaches, it is often the case wherein the model uses surrounding context to predict the output (for e.g., husky vs wolf). The figure below attempts to approach this by systematically occluding different portions of the input image with a grey square and monitoring the output of the classifier. The examples clearly demonstrate that the model is localizing the objects as the probability of the correct class drops when the object is occluded.
Classical recognition approaches establish correspondences between specific objects parts in the image (for e.g., faces have a particular spatial configuration of the eyes and nose) to recognize objects. With neural networks as a black box, it is quite difficult to understand whether such a correspondence is computed implicitly or not. The authors aim to explore this by taking 5 random dog images with frontal pose and then systematically masking out the same part of the face in each image. Using the feature vectors of the original image and the occluded image, their similarity is computed.
The authors state that the lower score for the eyes and nose compared to random occluding for layer 5 features show the model does establish some degree of correspondence. However, these values are still comparable for layer 7. A possible explanation is provided below.
5. Experiments
The authors experiment with the original AlexNet architecture and their modified one on the ImageNet dataset. Their architecture, when combined with multiple models, achieved the best published performance on ImageNet and became the SOTA of that time.
The authors then experiment by removing layer from their model. Removing the fully connected layers gave a slight increase in slight which is surprising since they contain the majority of the model parameters. They also state that removing two of the middle convolutional layers also makes a relatively small difference to the error rate. However, removing both of these leads to a model with only 4 layers which performs a lot worse. The authors then conclude that the overall depth of the model is important for obtaining good performance. However, it can be the case that more parameters would be required for their 4 layer model. What if the model was more wider than deep? What if more filters were used?
The authors then showcase the ability of their model to act as a feature extractor. They keep the model parameters frozen and train a softmax classifier on top using a different dataset (Caltech-101 and Caltech-256). This approach is compared to methods which utilize hand-crafted features. Their approach performs exceedingly well on both these datasets. They even experiment with the “one-shot” learning paradigm using just 6 Caltech-256 images per class to achieve comparable performance with methods that use 10 times as many images. These experiments showcase the power of the ImageNet trained feature extractor.
They perform the same experiment with PASCAL 2012. However, they achieve a performance lower than the leading method with the reason being stated that PASCAL images can contain multiple objects and their model just provides a single prediction which may hamper the feature extraction.
In another insightful experiment, the authors vary the number of layer retrained from the model and place a linear SVM classifier on top. For both Caltech-101 and Caltech-256, they achieve a steady improvement in performance as they ascend the layers indicating that as the feature hierarchies become deeper, the model learns increasingly powerful features.
6. Discussion
The authors present a novel way to visualize activations within the model and how these visualizations can be used to debug problems with the model to obtain better results. They perform several ablations studies which reveal insights regarding feature correspondence and invariance. The authors also show how the ImageNet trained model generalizes well to other datasets and successfully demonstrated one of the first examples of transfer learning.
7. Final Words
The paper is quite well-written and easy to follow. The several visualizations, although quite overwhelming, well supplement the reading.
8. References
Zeiler, Matthew D., and Rob Fergus. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part I 13. Springer International Publishing, 2014.