Paper Summary: Adam: A Method for Stochastic Optimization
Kingma, Diederik P., and Jimmy Ba. arXiv preprint arXiv:1412.6980 (2014)
You pick up a paper and you come across this at the end of the first page, you know it is going to be an interesting read.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
Stochastic gradient-based optimization is of core practical importance in many fields of science and engineering. The authors propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. They also discuss the theoretical convergence properties and show that the convergence rate is comparable to the current best-known results. Finally, they introduce a variant to the Adam method based on the infinity norm.
1. Introduction
They explain their novel optimizer, Adam, derived from adaptive moment estimation. This is designed to combine the advantages of AdaGrad and RMSProp, having it work well in the case of spare gradients as well as online/non-stationary settings. They highlight other advantages including, invariant to gradient rescaling, bounding of stepsize by the learning rate, and step size annealing.
2. Algorithm
The pseudo-code for Adam is given above. The algorithm updates exponential moving averages of the gradient (mt) and the squared gradient (vt) where the hyper-parameters β1, β2 ∈ [0, 1) control the exponential decay rates of these moving averages. The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient. However, these values are initialised with 0, leading to moment estimates that are biased towards zero during the initial steps. This is corrected by applying bias correction to both mt and vt.
Next, the authors discuss the bound of the stepsize, illustrating that in a general case, the upper bound is the learning rate. They also define a term called the signal-to-noise ratio (SNR) which is defined as mt/√vt. If the value of SNR is low, the stepsize will be close to zero. This is desired as a smaller SNR means that there is greater uncertainty about whether the direction of mt corresponds to the direction of the true gradient. The SNR also tends to zero near the optimum, leading to smaller stepsize in the space (a sort of annealing).
3. Initialisation Bias Correction
In this section, the authors derive the bias correction for the 2nd-moment estimates. The terms are described above.
What is left in the expectation calculation is used for the bias correction. We can take ζ = 0 if the second moment is stationary, otherwise, it can be kept small as β assigns small weights to gradients far in the past.
4. Convergence Analysis
In this section, the authors cover the theoretical guarantee of the convergence of the Adam optimizer. The proofs are given in the appendix of the paper if you want to know more about them.
One interesting thing that they highlight is that decaying β1,t towards zero is important in their theoretical analysis and also matches previous empirical findings as it can improve convergence.
5. Related Work
The authors start with the relationship between RMSProp with momentum and Adam. RMSProp with momentum generates its parameter updates using momentum on the rescaled gradient, whereas Adam updates are directly estimated using a running average of the first and second moment of the gradient. It also lacks a bias-correction term.
AdaGrad is an algorithm that works well for sparse gradients. It can be thought of as the version of Adam with β1 = 0, infinitesimal (1 − β2), without bias correction.
6. Experiments
To empirically evaluate the proposed method, the authors conduct a series of experiments on machine learning tasks. They also investigate the use of bias correction.
The first experiment was on a logistic regression model using the MNIST dataset. They compare Adam to accelerated SGD with Nesterov momentum and Adagrad. As seen above, Adam yields similar convergence as SGD with momentum and both converge faster than Adagrad.
The advantage of using AdaGrad is that it can work with spare features as well. In order to gauge performance of Adam in such a task, they again train a logistic regression model using the IMDB reviews dataset. They observe that AdaGrad outperforms SGD with Nesterov momentum by a large margin both with and without dropout noise, while Adam converges as fast as Adagrad.
The next experiment dealt with the training of a multilayer neural network on the MNIST dataset. They compared the effectiveness of Adam to other stochastic first-order methods on multi-layer neural networks trained with dropout noise, showcasing that Adam achieves better convergence. They also compare Adam with a newly proposed quasi-Newton method called sum-of-functions (SFO). They find that Adam makes faster progress in terms of both the number of iterations and wall-clock time.
The third experiment dealt with training a deep CNN on the CIFAR-10 dataset. Although both Adam and Adagrad make rapid progress in lowering the cost in the initial stage of the training, as shown above, Adam and SGD eventually converge considerably faster than Adagrad for CNNs. They also notice that the second-moment estimate vt vanishes to zeros after a few epochs and is dominated by the ε. Hence the second-moment estimate is a poor approximation to the geometry of the cost function in CNNs as compared to a fully connected network.
The final experiment evaluates the effect of the bias correction terms. They take up the task of training an autoencoder with varying values of β1 and β2, as shown above. Values β2 close to 1 lead to instabilities in training when no bias correction term was present, especially at the first few epochs of the training. The best results were achieved with small values of (1−β2) and bias correction; this was more apparent towards the end of optimization when gradients tend to become sparser as hidden units specialize to specific patterns.
7. Extensions
The authors propose a variant of the Adam optimizer based on the infinity norm wherein the L2 norm is replaced by the Infinity norm. The derivation can be understood from the paper quite easily. The pseudo-code for the same is given above.
They also propose temporal averaging over the model parameters. Since the iterations are noisy due to stochastic approximations, averaging will lead to better generalization performance.
8. Conclusion
The authors have proposed a simple and computationally efficient algorithm for gradient-based optimization of the stochastic objective function, combining the advantages of both RMSProp and AdaGrad. The experimental results confirm Adam to be a versatile optimizer for both convex and non-convex optimisation tasks in the field of machine learning.
9. Final Words
The paper introduces Adam in a coherent manner which is easy to follow. The math in section 4 (Convergence Analysis) is a bit hard to follow as I am not too familiar with such analysis. Other than that, the rest of the paper is quite intuitive and backed by lots of experimental results.
10. References
Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980 (2014).