Paper Summary: Understanding the difficulty of training deep feedforward neural networks
Glorot, Xavier, and Yoshua Bengio. JMLR Workshop and Conference Proceedings, 2010
This paper is one of the foundational papers of the deep learning domain, introducing mainly the Xavier initialisation and various visualisations to analyze the training process.
Abstract: Whereas before 2006 it appears that deep multi-layer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.
The effectiveness of deep learning models relies heavily on the choice of activation function and initialisation. This paper aims to analyze the effect of various activation functions on the overall training process as well as suggests a better initialisation than the then used unsupervised pretraining.
1. Deep Neural Networks
The authors state that the classical methods of initialisation and training a neural network are falling behind recent methods in this domain. They wish to pursue this further and also provide a plausible explanation for the effectiveness of unsupervised pretraining.
They further state that the objective of the paper is to analyze the activations (watching for saturation of hidden units) and gradients, across layers and across training iterations. They will also evaluate the effect of these on the choice of activation function and initialisation.
2. Experimental Setup
They utilise 4 datasets for their experiments: MNIST digits, CIFAR-10, Small ImageNet and their self-created dataset, Shapeset (described above).
They use feedforward neural networks with one to five layers, each having 1000 hidden units and a softmax logistic regression for the output layer. The learning rate for each model is optimized based on the validation set error.
They ran all experiments with 3 different activation functions: Sigmoid, Hyperbolic Tangent and SoftSign. To compare them properly, they searched for the best hyperparameters (learning rate and depth) separately for each model.
Finally, every model was initialised with biases set to 0 and the weight matrix initialised as above, where U stands for uniform distribution.
3. Effect of Activation Functions and Saturation During Training
The thing that caught my eye was the visualisations used here which was probably the first time someone was using them to analyze the training of a deep neural network. So please don’t mind if I spend the majority of the blog discussing them :P
They first experimented with the Sigmoid activation function and found that the activation values of the last layer are quickly pushed to 0, while the other layers have a mean activation of more than 0.5 which decreases as we go from the input layer to the output layer. They found that the last layer may move out of the low saturation zone, although it may take several iterations of training (the graph above is of a depth 4 model, the 5 depth model never recovered).
With the tanh activation function, they did not observe the kind of saturation behaviour of the top hidden layer observed with sigmoid networks, because of its symmetry around 0. One strange thing was that they was a sequentially occurring saturation phenomenon starting with layer 1 and propagating up in the network which can be seen in the top graph in Figure 3. The reason for this is unknown to the authors.
The behaviour of the softsign activation function is explained above.
Another interesting visualisation is shown above. Shown above are the activation values normalized histogram at the end of learning, averaged across units of the same layer. The top graph utilised the tanh activation function while the bottom used the softsign function.
The tanh graph has modes of the distribution of the activation mostly at the extremes (asymptotes -1 and 1) or around 0, while the softsign network has modes of activation around its knees (between the linear regime around 0 and the flat regime around -1 and 1). These are the areas where there is substantial non-linearity but where the gradients would flow well.
4. Studying Gradients and their Propagation
They did a comparative study between negative log-likelihood loss and something called quadratic loss (Scaled Euclidean Loss used for Classification). Unsurprisingly, they found that the negative log-likelihood loss worked better.
Next, they studied how the information flow should occur from both a forward prop and backward prop point of view and derived a new initialisation scheme. This is a bit mathematical and a rough derivation of this can be found here (please do feel free to correct me if I made any errors here). The conclusion from this was that they came up with a normalized initialisation. One important thing highlighted here was that the variance of the back-propagated gradients gets smaller as it is propagated downwards.
Finally, to validate the above theoretical ideas, they ran several experiments and observed the activations and gradients. Again, pretty great visualisations are shown which gives the reader an intuitive understanding of all this.
Firstly, the activation values are found to have a more spread-out curve than the one without any normalized initialisation.
As stated above, the back-propagated gradients become slower as we go towards the output layer, but the variance of the weight gradient is roughly the same across all the layers (normal initialisation). This is explained by their theoretical analysis above and is tackled by their normalized initialisation.
5. Conclusion
- Monitoring activations and gradients across layers and training iterations is a powerful investigative tool for understanding training difficulties in deep nets.
- Sigmoid activations (not symmetric around 0) should be avoided when initializing from small random weights, because they yield poor learning dynamics, with initial saturation of the top hidden layer.
- The proposed normalized initialization can be quite helpful, presumably because the layer-to-layer transformations maintain magnitudes of activations (flowing upward) and gradients (flowing backward), and allows to eliminate a good part of the discrepancy between purely supervised deep networks and ones pre-trained with unsupervised learning.
6. Final Words
The paper is extremely well written and is quite easy to understand. The visualisations are pretty great and provide a more in-depth view of the training process.
7. Reference
Glorot, Xavier, and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neural networks.” Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010.