Paper Summary: On the importance of initialization and momentum in deep learning

7 min readJul 28, 2022

Sutskever, Ilya, et al. International conference on machine learning. PMLR, 2013

This paper showcases how momentum alongside well-designed random initialisation of neural networks can improve the training process.

Abstract: Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

The effectiveness of stochastic gradient descent can be increased by having better well-designed random initialisation and by the use of momentum methods. The paper explores the usage of momentum for training DNNs and showcases how a first-order method can effectively train a neural network without the need for complex second-order methods.

1. Introduction

The authors highlight a state-of-the-art optimization method, called Hessian-free Optimization (HF) which will later be used to compare to their own proposed methods.

The authors further highlight that momentum performs better than even HF at some tasks and have done an analysis of the momentum methods with different types of initialisations. They will study the method with 2 tasks: training an autoencoder and training an RNN.

There was previous work done regarding the use of momentum to train neural network models but it never realised its importance. An explanation for the same is provided above by them.

2. Momentum and Nesterov’s Accelerated Gradient

The authors first highlight the classical momentum (CM). The update equations are given above. The basic idea behind CM is that it accumulates a velocity vector in directions of persistent reduction in the objective across iterations. Directions of low-curvature which are suffering from a slow local change in their reduction, these will tend to persist across iterations and hence be amplified by the use of CM.

Nesterov’s Accelerated Gradient (NAG) is now described by the authors (update equations given above).

While CM computes the gradient update from the current position θt, NAG first performs a partial update to θt, computing θt + μvt, which is similar to θt+1, but missing the as yet unknown correction. This benign-looking difference seems to allow NAG to change v in a quicker and more responsive way, letting it behave more stably than CM in many situations, especially for higher values of μ.

The reason behind the effectiveness of NAG over CM is explained via an example above. While each iteration of NAG may only be slightly more effective than CM at correcting a large and inappropriate velocity, this difference in effectiveness may compound as the algorithms iterate.

Next, some mathematics is given which is elucidated in the appendix. I’d suggest a reading to help understand it better at the end of the summary. It is at Distill and has some cool visualisation which helps understand this better. The conclusion from the above is as follows:

CM and NAG become equivalent when ε is small (when ελ << 1 for every eigenvalue λ of A), so NAG and CM are distinct only when ε is reasonably large.
When ε is relatively large, NAG uses smaller effective momentum for the high-curvature eigen-directions, which prevents oscillations (or divergence) and thus allows the use of a larger µ than is possible with CM for a given ε

3. Deep Autoencoders

Next, the authors propose to test their assumptions through an experiment of training an autoencoder. Their aim is to investigate the performance of momentum from well-designed random initialisations, to explore the scheduling of µ and to compare NAG and CM. The networks used the standard sigmoid nonlinearity and were initialized using the “sparse initialization” technique (SI).

In this scheme, each random unit is connected to 15 randomly chosen units in the previous layer, whose weights are drawn from a unit Gaussian, and the biases are set to zero. This is further scaled by a scaling factor depending on the type of activation function used.

The momentum coefficient is scheduled and follows the above scheme. Previous works by Nesterov suggested decaying the momentum coefficient as training progressed to help in the fine-tuning. The reasoning behind this is explained ahead.

The results show that NAG achieves the lowest published results on this set of problems, including HF. It also shows that larger values of µmax tend to achieve better performance and that NAG usually outperforms CM, especially when µmax is 0.995 and 0.999.

They further highlight that reducing the momentum coefficient during the ‘transient stage’ of training leads to finer convergence to take place which wouldn’t be possible with high values of momentum coefficient as it leads to aggressive nature of CM / NAG. While a large value of µ allows the momentum methods to make useful progress along slowly-changing directions of low-curvature, this may not immediately result in a significant reduction in error, due to the failure of these methods to converge in the more turbulent high-curvature directions (which is especially hard when µ is large). Nevertheless, this progress in low-curvature directions takes the optimizers to new regions of the parameter space that are characterized by closer proximity to the optimum.

4. Recurrent Neural Networks

The author states that they found that momentum-accelerated SGD can successfully train such RNNs on various artificial datasets exhibiting considerable long-range temporal dependencies. This is unexpected because RNNs were believed to be almost impossible to successfully train on such datasets with first-order methods, due to various difficulties such as vanishing/exploding gradients. They also discuss on the type of initialisation scheme to use for RNNs.

Their results show that despite the considerable long-range dependencies present in training data for these problems, RNNs can be successfully and robustly trained to solve them, through the use of the initialization discussed, momentum of the NAG type, a large µ, and a particularly small learning rate.

5. Momentum and HF

In this section, the author compares the HF method and derives an analogy for the same to the momentum method through its special initialisations. The authors use this to develop a more momentum-like version of HF which combines some of the advantages of both methods (see Table 1).

6. Discussion

A large part of the remaining performance gap that is not addressed by using a well-designed random initialization is in fact addressed by careful use of momentum-based acceleration. Although careful attention must be paid to the momentum constant µ
Momentum-accelerated SGD, despite being a first-order approach, is capable of accelerating directions of low-curvature

7. Final Words

The paper turns a bit mathematical at the end of section 2 which can be difficult to follow. I found this blog on Distill which explains that part with amazing visualisations. Do read if you want a more in-depth analysis of momentum.

8. References

Sutskever, Ilya, et al. “On the importance of initialization and momentum in deep learning.” International conference on machine learning. PMLR, 2013.
Goh, “Why Momentum Really Works”, Distill, 2017. http://doi.org/10.23915/distill.00006