Paper Summary: DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

7 min readJun 26, 2023

Donahue, Jeff, et al. International conference on machine learning. PMLR, 2014.

DeCAF is probably one of the first papers to demonstrate the concept of transfer learning and how CNNs can be used effectively for a variety of visual recognition tasks. This paper investigates the usefulness of features extracted from a pretrained CNN to novel generic tasks and achieves state-of-the-art performance in most of them.

1. Introduction

The paper starts off by stating deep or layered compositional architectures should be able to capture salient aspects of a given domain through the discovery of salient features. Such models have been able to perform better than traditional hand-engineered representations in many domains, outperforming methods like HOG. It then talks about AlexNet and how such models perform extremely well in domains with large amounts of training data. But with limited training data, such architectures will generally dramatically overfit the training data. Thus, the authors plan to investigate the following:

Their goal is to verify whether visual features based on convolutional network weights trained on ImageNet outperforms a host of conventional visual representations on standard benchmark tasks, like object recognition, domain adaptation, subcategory recognition and scene recognition. They also plan to compare these to conventional feature representations.

2. Deep Convolutional Activation Features

The authors train AlexNet (in a similar fashion as the original paper) and then extract various features from this network, and evaluate the efficacy of these features on generic vision tasks. Their main goal is to answer the following two questions:

Do features extracted from the CNN generalize to other datasets?
How do these features perform versus depth?

They then visually compare the features obtained from DeCAF with current state-of-the-art methods. They run the t-SNE algorithm to find a 2-dimensional embedding of the high-dimensional feature space and plot them as points coloured depending on their semantic category in a particular hierarchy.

They first visualize the semantic segregation of the model by plotting the embedding of labels for higher levels of the WordNet hierarchy, for example, indoor and outdoor instances, for the validation set of ImageNet.

They note that the features extracted on the validation set using the first pooling layer, and the second to last fully connected layer, show a clear semantic clustering in the latter but not in the former, which is common deep learning knowledge that the first layers learn low-level features, whereas the latter layers learn higher-level features.

But this experiment was on a dataset on which the model was already trained on. The authors wanted to analyze what would have in the case of a different dataset. Hence, they utilise SUN-397 dataset.

Even in this case, they note that the features show very good clustering of semantic classes and state that these features are an excellent starting point for generalizing to unseen classes.

The authors then state that a detailed analysis of the computation time over the multiple layers involved is still missing in the literature and hence to fill that gap, they perform a time analysis of the AlexNet model. They observe that the convolution and fully connected layers take most of the time to run, but the last few fully-connected layers require the most computation time as they involve large transform matrices.

3. Experiments

In each of the experiments, they take the activations of the nth hidden layer of the CNN as a feature DeCAFn. For e.g., DeCAF7 denotes features taken from the final hidden layer, DeCAF6 is the activations of the layer before DeCAF7 and so on. They took only DeCAF5, DeCAF6 and DeCAF7 as the earlier layers are unlikely to contain a richer semantic representation than the later features. They present results on multiple datasets to evaluate the strength of DeCAF for basic object recognition, domain adaptation, fine-grained recognition, and scene recognition.

Making use of the Caltech -101 dataset, a logistic regression or support vector machine is trained on a random set of samples per class, and tested on the rest of the data, with parameters cross-validated for each split. The top-performing method trains a linear SVM on DeCAF6 with dropout, while the DeCAF5 features perform substantially worse than both the DeCAF6 and DeCAF7 feature. One important thing to note is that DeCAF7 features generally have an accuracy of about 1–2% lower than the DeCAF6 features on this task. This might be because these features can encode more meaningful and discriminative representations of the input data. On the other hand, the last layer might focus on fine-grained details that are more specific to the original task of the network. By using the second-to-last layer, one may leverage more generalizable features that can be helpful for other tasks or datasets.

They also show how the performance of the two DeCAF6 with dropout methods above vary with the number of training cases per category. They state that their one-shot learning results suggest that with sufficiently strong representations like DeCAF, useful models of visual categories can often be learned from just a single positive example.

Next, they evaluate DeCAF for use on the task of domain adaptation, using the Office dataset. This dataset contains three domains: Amazon, which consists of product images taken from amazon.com; and Webcam and Dslr, which consists of images taken in an office environment using a webcam and DSLR camera, respectively. To analyze the ability of the model to accommodate to different domains, they visually compare DeCAF6 and SURF features, using t-SNE. They find that DeCAF not only provides better within-category clustering but also clusters same-category instances across domains. This indicates qualitatively that DeCAF removed some of the domain bias between the Webcam and DSLR domains. In the figure above, all images from the scissor class are well clustered and overlapping in both domains with DeCAF, while SURF only clusters a subset and places the others in disjoint parts of the space, closest to distinctly different categories such as chairs and mugs.

To quantitatively validate this, they perform another experiment for the domain shifts Amazon->Webcam and DSLR-> Webcam. They compare SURF with the DeCAF6, and DeCAF7 using an SVM and Logistic Regression trained in 3 ways: with only source data (S), only target data (T), and source and target data (ST). They conclude that DeCAF dramatically outperforms the baseline SURF feature available with the Office dataset.

Next, the authors tested the performance of DeCAF on the task of subcategory recognition on the Caltech-UCSD birds dataset. The results report that DeCAF together with a simple logistic regression obtains a significant performance increase over existing approaches, indicating that such features, although not specifically designed to model subcategory level differences, capture such information well.

Finally, they evaluate DeCAF on the SUN-397 large-scale scene recognition database. The authors believe that because they are applying DeCAF to a task for which it was not designed, it might be very challenging for the model unless the features are highly generic representations of the visual world. They train a model using the above strategies with DeCAF6 and DeCAF7, and see a performance improvement over the current state-of-the-art method.

4. Discussion

The paper analyzes the use of deep features applied in a semi-supervised multi-task framework. The authors quantitatively demonstrate that by using a large labelled object database to train a deep convolutional architecture, we can learn features that have sufficient generalization ability to perform a wide variety of tasks, including domain adaptation, fine-grained part-based recognition, and large-scale scene recognition.

5. Final Words

The paper is quite well-written and has exhaustive experiments. It is quite fundamental in regard to deep learning knowledge as it is one of the first use cases of transfer learning in CNNs with such experimentation.

6. References

Donahue, Jeff, et al. International conference on machine learning. PMLR, 2014