Paper Summary: Dropout: A Simple Way to Prevent Neural Networks from Overfitting

7 min readMay 30, 2023

Srivastava, Nitish, et al. The Journal of Machine Learning Research (2014)

My first introduction to dropout was in Andrew Ng’s Deep Learning Specialisation on Coursera which gave a pretty neat intuition of how and why dropout works using a simple cat classifier example which has stuck with me ever since. Although this technique was already used in the AlexNet paper, this manuscript provides an in-depth understanding and analysis of dropout, with thorough experimentation.

1. Introduction

The paper starts off by stating that model combination nearly always improves performance. However, to combine neural network models, they should either be trained on different subsets of the data or have different architectures. Both of these scenarios pose challenges.

Using the Bayesian gold standard, the best way to average the predictions of several different models, weighing them by their posterior probability given the training data. Dropout proposes an approximation of this wherein they take an equally weighted geometric mean of the predictions of an exponential number of learned models that share parameters. Applying this technique is equivalent to sampling a subset of the actual architecture of the neural network where the parameters are being shared.

2. Motivation

The authors claim that the motivation for this technique comes from one of the drivers of mankind, sex.

3. Model Description

With dropout, the feed-forward operation becomes as shown below, where the vector r is a vector of independent Bernoulli random variables each of which has probability p of being 1. This amounts to sampling a sub-network from a larger network.

The change in the architecture is as shown below.

At test time, the weights W are scaled as pW and the resulting neural network is used without dropout.

4. Learning Dropout Nets

Dropout neural networks can be trained using stochastic gradient descent in a manner similar to standard neural nets. But forward and backpropagation are done only on this thinned network. The authors state that a particular regularisation technique was found to be especially useful for dropout: max-norm regularisation. It can be described as follows:

They further state that dropout along with max-norm regularisation, large decaying learning rates and high momentum provide a significant boost over simply using dropout. They justify this by the following intuition: Constraining weight vectors to lie inside a ball of fixed radius makes it possible to use a huge learning rate without the possibility of weights blowing up. The noise provided by dropout then allows the optimization process to explore different regions of the weight space that would have otherwise been difficult to reach. As the learning rate decays, the optimization takes shorter steps, thereby doing less exploration and eventually settling into a minimum.

The authors also note that dropout can be applied to finetune nets that have been pretrained by scaling the weights up by a factor of 1/p.

5. Experimental Results

The authors trained dropout neural networks for classification problems on a variety of datasets, demonstrating that dropout is a general technique for improving neural networks and is not specific to any particular application domain.

They achieve state-of-the-art performance on almost all of these datasets and even win the ILSVRC-2012 competition using dropout neural networks. One thing they note is that the improvement on the text dataset was much smaller as compared to that for the vision and speech datasets.

They further compare the technique with Bayesian Neural Networks. BNNs offer a proper way of performing model averaging over the space of neural network architecture and parameters. Here, each mode is weighed taking into account the prior and posterior probability. On the other hand, dropout performs an equally weighted average of exponentially many models with shared parameters. The authors conduct experiments that compare BNNs with dropout neural networks on the dataset where BNNs are known to obtain state-of-the-art results. They found that BNNs perform better than dropout but dropout improves significantly upon the performance of standard neural networks and outperforms all other methods.

6. Salient Features

The authors now explore the effect dropout has on the quality of features produced and sparsity of hidden unit activations; the effect of varying dropout rate and size of the training set; and compare dropout with Monte-Carlo Model averaging.

In a standard neural network, each parameter receives a gradient which tells it how the parameter should change so that the loss is reduced, given what all the other units are doing. This may lead to co-adaptations between the units causing a loss of generalisation. For each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable. A hidden unit must perform well in a variety of different contexts provided by the other hidden units. The figure above gives a visualisation of the features produced with and without dropout. The hidden units using dropout seem to detect edges and spots in different parts of the image.

The authors also note that a side-effect of using dropout is sparse representations, even when no sparsity-inducing regularizers are present.

Next, the authors experiment with different values of p, keeping the number of hidden units n constant. In this case, having a small p means very few units will turn on during training, leading to underfitting. As p increases the error goes down and then increases as p becomes closer to 1.

In another variation of the same, the authors keep pn constant, while varying the value of p. This means that networks that have a small value of p will have a large number of hidden units, and vice-versa. However, the test networks will of different sizes. In this case, values of p that are close to 0.6 seem to perform best for their choice of pn.

Next, the authors experiment to gauge the effect of changing dataset size when dropout is used. They observe that for very small datasets (100, 500), dropout does not give any improvements. The model has enough parameters that it can overfit the training data. As the size of the dataset is increased, the gain from performing dropout increases and then declines.

During test time using dropout, we approximate model combinations by scaling down the weights of the trained neural networks. A more correct way of averaging is to sample k neural nets for each test case using dropout and average their predictions. As k tends to infinity, the model average gets close to the true model average. From the above figure, it can be seen that when k=50, the Monte-Carlo method becomes as good as the approximate method, suggesting that the weight scaling method is a fairly good approximation of the true model average.

7. Multiplicative Gaussian Noise

This section is an interesting insight by the authors wherein they generalise dropout as multiplying the activations with random variables drawn from probability distributions. They state that multiplying by a random variable drawn from N(1,1) works just as well or perhaps better than using Bernoulli noise. It is somewhat math-heavy and it is better if one reads this section directly from the paper.

8. Conclusion

The authors state that dropout is a general technique for improving neural networks by reducing overfitting. It breaks up co-adaptations by making the presence of any particular hidden unit reliable and thereby, reduces overfitting. One of the drawbacks of dropout is that it increases training time. A dropout network typically takes 2–3 times longer to train than a standard neural network of the same architecture. A major cause of this increase is that the parameter updates are very noisy. However, it is likely that this stochasticity prevents overfitting.

9. Final Words

The paper is extremely well written and is quite easy to understand, alongside providing an in-depth analysis of one of the most important training techniques.

10. References

Srivastava, Nitish, et al. The Journal of Machine Learning Research (2014)