Paper Summary: ImageNet Classification with Deep Convolutional Neural Networks

6 min readMay 21, 2021

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. NIPS 2012.

This is a revolutionary paper in the find of Deep Learning that introduces the AlexNet model, a deep convolutional neural network that absolutely demolished the competition in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2012, beating the previous best by over 10% accuracy. This is the paper that pioneered the current trend towards convolutional networks and deep learning as a whole. It explores many things that are commonplace today in deep learning training like using ReLU activation functions instead of sigmoid/tanh, using dropout for regularisation, using multiple GPUs for training, etc.

Major Claims/Findings of the paper:

“Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units”
“Local normalization scheme aids generalization… We also verified the effectiveness of this scheme on the CIFAR-10 dataset”
“We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit”
“This technique (dropout) reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons… Dropout roughly doubles the number of iterations required to converge.”
“We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error.”

However, the most strong finding is that:

Our results show that a large, deep convolutional neural network is capable of achieving recordbreaking results on a highly challenging dataset using purely supervised learning.

We’ll explore all these points in detail in the following sections.

Architecture

ReLU Nonlinearity: They demonstrate using a four-layer convolutional neural network (trained on CIFAR-10 dataset) that networks with ReLUs consistently learn several times faster than equivalent networks with tanh or sigmoid neurons. The reason is that ReLU doesn’t lead to saturation of neurons where a highly positive/negative quantity leads to a derivative close to zero.

A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line).

Training on multiple GPUs: The architecture was divided between two GPUs having 3GB memory each to account for the large number of parameters. Using two GPUs reduces their top-1 and top-5 error rates by 1.7% and 1.2% respectively, as compared with a net with half as many kernels in each convolutional layer trained on a single GPU.

Local Response Normalisation: The authors employed ReLU activation functions which have the property that the inputs do not require to be normalised to prevent saturation. However, they still claim that the local normalisation scheme they applied aids in generalization (having verified on the CIFAR-10 dataset too).

The equation basically takes the activation value outputted by a kernel at a specific pixel and normalises it with the activations at the same coordinates outputted by n adjacent kernels. The equation involves some hyperparameters that were learned from the validation set. This technique reduced their top-1 and top-5 error rates by 1.4% and 1.2%, respectively.

Overlapping Pooling: Pooling layers in CNNs summarise the outputs of neighbouring groups of neurons in the same kernel map. In prior work, the stride was set equal to the filter size so that a pixel appears in a kernel only once. The authors propose to use overlapping pooling so that a pixel appears multiple times, with stride 2 and filter size 3x3. They also claim that they observed that training models with overlapping pooling find it slightly more difficult to overfit. This scheme reduced their top-1 and top-5 error rates by 0.4% and 0.3%, respectively.

Overall Architecture: The network contains 8 layers (the first 5 being convolutional and the remaining 3 are fully connected). The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. Response normalisation layers follow the first and second convolutional layers while pooling layers follow both response normalisation layers as well as the fifth convolutional layer. Further details parameters of the layers are described in the table below:

Something which is not apparent from the table is the number of parameters involved. The first layer has about 35 thousand parameters, the second layer has about 3 million parameters, whereas the last 2 fully connected layers comprise more than 55 million parameters (which is close to 94% of the total number of parameters in the model).

Reducing Overfitting

Pre-processing: ImageNet consists of variable-resolution images, while their network required a constant input dimensionality. Therefore, they downsampled the images to a fixed resolution of 256x256. Given a rectangular image, the shorter side is rescaled to 256 and then the central 256x256 patch is cropped out. No other pre-processing was done except subtracting the mean activity over the training set from each pixel.

Data Augmentation: The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations. The authors use random cropping of 227x227 patches from the 256x256 image as well as horizontal flipping such that they increase the dataset size by a factor of 2048. This step is critical to achieving high accuracy models as without this the model suffers from substantial overfitting.

During test time, the network makes the prediction by extracting five 227x227 patches (the four corner patches and the centre patch) as well as their horizontal reflections (hence 10 in total) and averages the predictions made by the network’s softmax layer on the 10 patches.

The second form of data augmentation used changes the RGB values, keeping in mind the fact that changes in the intensity and colour of illumination are predominant in natural images. This scheme reduces the top-1 error rate by over 1%.

Dropout: They apply dropout with a probability of 0.5 in the last two fully connected layers on the network, which is in my opinion very heavy regularisation. The paper beautifully captures the aim of dropout:

This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

Without dropout, their network exhibited substantial overfitting; however, it roughly doubled the number of iterations required to converge.

Training Process

There was nothing much monumental in this part. The author employed mini-batch gradient descent with momentum, weight decay as well as learning rate decay. The weights were initialised in each layer from a zero-mean Gaussian distribution. The overall training took 6 days on two NVIDIA GTX 580 3GB GPUs. However, their results were what was ground-breaking. They completely shattered the previous best in the ImageNet challenge.

Final Remarks

They correctly demonstrate that a large deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning, without any pretraining. They showcase that things like ReLU and dropout are vital to training deep models. They note that the network’s performance degrades even if a single convolutional layer is removed, indicating that depth is highly important. They also didn’t use any unsupervised pre-training even though it might help a lot with the training process, as it simplified their experiments.