Why Momentum Really Works

Goh, Gabriel

doi:10.23915/distill.00006

Acknowledgments

I am deeply indebted to the editorial contributions of Shan Carter and Chris Olah, without which this article would be greatly impoverished. Shan Carter provided complete redesigns of many of my original interactive widgets, a visual coherence for all the figures, and valuable optimizations to the page’s performance. Chris Olah provided impeccable editorial feedback at all levels of detail and abstraction - from the structure of the content, to the alignment of equations.

I am also grateful to Michael Nielsen for providing the title of this article, which really tied the article together. Marcos Ginestra provided editorial input for the earliest drafts of this article, and spiritual encouragement when I needed it the most. And my gratitude extends to my reviewers, Matt Hoffman and Anonymous Reviewer B for their astute observations and criticism. I would like to thank Reviewer B, in particular, for pointing out two non-trivial errors in the original manuscript (discussion here). The contour plotting library for the hero visualization is the joint work of Ben Frederickson, Jeff Heer and Mike Bostock.

Many thanks to the numerous pull requests and issues filed on github. Thanks in particular, to Osemwaro Pedro for spotting an off by one error in one of the equations. And also to Dan Schmidt who did an editing pass over the whole project, correcting numerous typographical and grammatical errors.

Discussion and Review

Reviewer A - Matt Hoffman
Reviewer B - Anonymous
Discussion with User derifatives

Footnotes

It is possible, however, to construct very specific counterexamples where momentum does not converge, even on convex functions. See [4] for a counterexample.
In Tikhonov Regression we add a quadratic penalty to the regression, minimizing $\text{minimize}\qquad\tfrac{1}{2}\|Zw-d\|^{2}+\frac{\eta}{2}\|w\|^{2}=\tfrac{1}{2}w^{T}(Z^{T}Z+\eta I)w-(Zd)^{T}w$ Recall that $Z^{T}Z=Q\ \text{diag}(\Lambda_{1},\ldots,\Lambda_{n})\ Q^T$ . The solution to Tikhonov Regression is therefore $(Z^{T}Z+\eta I)^{-1}(Zd)=Q\ \text{diag}\left(\frac{1}{\lambda_{1}+\eta},\cdots,\frac{1}{\lambda_{n}+\eta}\right)Q^T(Zd)$ We can think of regularization as a function which decays the largest eigenvalues, as follows: $\text{Tikhonov Regularized } \lambda_i = \frac{1}{\lambda_{i}+\eta}=\frac{1}{\lambda_{i}}\left(1-\left(1+\lambda_{i}/\eta\right)^{-1}\right).$ Gradient descent can be seen as employing a similar decay, but with the decay rate $\text{ Gradient Descent Regularized } \lambda_i = \frac{1}{\lambda_i} \left( 1-\left(1-\alpha\lambda_{i}\right)^{k} \right)$ instead. Note that this decay is dependent on the step-size.
This is true as we can write updates in matrix form as $\left(\!\!\begin{array}{cc} 1 & 0\\ \alpha & 1 \end{array}\!\!\right)\Bigg(\!\!\begin{array}{c} y_{i}^{k+1}\\ x_{i}^{k+1} \end{array}\!\!\Bigg)=\left(\!\!\begin{array}{cc} \beta & \lambda_{i}\\ 0 & 1 \end{array}\!\!\right)\left(\!\!\begin{array}{c} y_{i}^{k}\\ x_{i}^{k} \end{array}\!\!\right)$ which implies, by inverting the matrix on the left, $\Bigg(\!\!\begin{array}{c} y_{i}^{k+1}\\ x_{i}^{k+1} \end{array}\!\!\Bigg)=\left(\!\!\begin{array}{cc} \beta & \lambda_{i}\\ -\alpha\beta & 1-\alpha\lambda_{i} \end{array}\!\!\right)\left(\!\!\begin{array}{c} y_{i}^{k}\\ x_{i}^{k} \end{array}\!\!\right)=R^{k+1}\left(\!\!\begin{array}{c} x_{i}^{0}\\ y_{i}^{0} \end{array}\!\!\right)$
We can write out the convergence rates explicitly. The eigenvalues are $\begin{aligned} \sigma_{1} & =\frac{1}{2}\left(1-\alpha\lambda+\beta+\sqrt{(-\alpha\lambda+\beta+1)^{2}-4\beta}\right)\\[0.6em] \sigma_{2} & =\frac{1}{2}\left(1-\alpha\lambda+\beta-\sqrt{(-\alpha\lambda+\beta+1)^{2}-4\beta}\right) \end{aligned}$ When the $(-\alpha\lambda+\beta+1)^{2}-4\beta<0$ is less than zero, then the roots are complex and the convergence rate is $\begin{aligned} |\sigma_{1}|=|\sigma_{2}| & =\sqrt{(1-\alpha\lambda+\beta)^{2}+|(-\alpha\lambda+\beta+1)^{2}-4\beta|}=2\sqrt{\beta} \end{aligned}$ Which is, surprisingly, independent of the step-size or the eigenvalue $\alpha\lambda$ . When the roots are real, the convergence rate is $\max\{|\sigma_{1}|,|\sigma_{2}|\}=\tfrac{1}{2}\max\left\{ |1-\alpha\lambda_{i}+\beta\pm\sqrt{(1-\alpha\lambda_{i}+\beta)^{2}-4\beta}|\right\}$
This can be derived by reducing the inequalities for all 4 + 1 cases in the explicit form of the convergence rate above.
We must optimize over $\min_{\alpha,\beta}\max\left\{ \bigg\| \! \left(\begin{array}{cc} \beta & \lambda_{i}\\ -\alpha\beta & 1-\alpha\lambda_{i} \end{array}\right) \! \bigg\|,\ldots,\bigg\| \! \left(\begin{array}{cc} \beta & \lambda_{n}\\ -\alpha\beta & 1-\alpha\lambda_{n} \end{array}\right)\! \bigg\|\right\}.$ ( $\|\cdot \|$ here denotes the magnitude of the maximum eigenvalue), and occurs when the roots of the characteristic polynomial are repeated for the matrices corresponding to the extremal eigenvalues.
The above optimization problem is bounded from below by $0$ , and vector of all $1$ ’s achieve this.
This can be written explicitly as $[L_{G}]_{ij}=\begin{cases} \text{degree of vertex }i & i=j\\ -1 & i\neq j,(i,j)\text{ or }(j,i)\in E\\ 0 & \text{otherwise} \end{cases}$
We use the infinity norm to measure our error, similar results can be derived for the 1 and 2 norms.
The momentum iterations are $\begin{aligned} z^{k+1}&=\beta z^{k}+ A w^{k} + \text{error}(w^k) \\[0.4em] w^{k+1}&=w^{k}-\alpha z^{k+1}. \end{aligned}$ which, after a change of variables, become $\left(\!\!\begin{array}{cc} 1 & 0\\ \alpha & 1 \end{array}\!\!\right)\Bigg(\!\!\begin{array}{c} y_{i}^{k+1}\\ x_{i}^{k+1} \end{array}\!\!\Bigg)=\left(\!\!\begin{array}{cc} \beta & \lambda_{i}\\ 0 & 1 \end{array}\!\!\right)\left(\!\!\begin{array}{c} y_{i}^{k}\\ x_{i}^{k} \end{array}\!\!\right)+\left(\!\!\begin{array}{c} \epsilon_{i}^{k}\\ 0 \end{array}\!\!\right)$ Inverting the $2 \times 2$ matrix on the left, and applying the formula recursively yields the final solution.
On the 1D function $f(x)=\frac{\lambda}{2}x^{2}$ , the objective value is $\begin{aligned} \mathbf{E}f(x^{k})&=\frac{\lambda}{2}\mathbf{E}[(x^{k})^{2}]\\&=\frac{\lambda}{2}\mathbf{E}\left(e_{2}^{T}R^{k}\left(\begin{array}{c} y^{0}\\ x^{0} \end{array}\right)+\epsilon^{k}e_{2}^{T}\sum_{i=1}^{k}R^{k-i}\left(\begin{array}{c} 1\\ -\alpha \end{array}\right)\right)^{2}\\&=\frac{\lambda}{2}e_{2}^{T}R^{k}\left(\begin{array}{c} y^{0}\\ x^{0} \end{array}\right)+\frac{\lambda}{2}\mathbf{E}\left(\epsilon^{k}e_{2}^{T}\sum_{i=1}^{k}R^{k-i}\left(\begin{array}{c} 1\\ -\alpha \end{array}\right)\right)^{2}\\&=\frac{\lambda}{2}e_{2}^{T}R^{k}\left(\begin{array}{c} y^{0}\\ x^{0} \end{array}\right)+\frac{\lambda}{2}\mathbf{E}[\epsilon^{k}]\,\cdot\,\sum_{i=1}^{k}\left(e_{2}^{T}R^{k-i}\left(\begin{array}{c} 1\\ -\alpha \end{array}\right)\right)^{2}\\&=\frac{\lambda}{2}e_{2}^{T}R^{k}\left(\begin{array}{c} y^{0}\\ x^{0} \end{array}\right)+\frac{\lambda\mathbf{E}[\epsilon^{k}}{2}\cdot\sum_{i=1}^{k}\gamma_{i}^{2}, \qquad \gamma_i = e_{2}^{T}R^{k-i}\left(\begin{array}{c} 1\\ -\alpha \end{array}\right) \end{aligned}$ The third inequality uses the fact that $\mathbf{E} \epsilon^k = 0$ and the fourth uses the fact they are uncorrelated.

References

On the importance of initialization and momentum in deep learning. [PDF]
Sutskever, I., Martens, J., Dahl, G.E. and Hinton, G.E., 2013. ICML (3), Vol 28, pp. 1139—1147.
Some methods of speeding up the convergence of iteration methods [PDF]
Polyak, B.T., 1964. USSR Computational Mathematics and Mathematical Physics, Vol 4(5), pp. 1—17. Elsevier. DOI: 10.1016/0041-5553(64)90137-5
Theory of gradient methods
Rutishauser, H., 1959. Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems, pp. 24—49. Springer. DOI: 10.1007/978-3-0348-7224-9_2
Analysis and design of optimization algorithms via integral quadratic constraints [PDF]
Lessard, L., Recht, B. and Packard, A., 2016. SIAM Journal on Optimization, Vol 26(1), pp. 57—95. SIAM.
Introductory lectures on convex optimization: A basic course
Nesterov, Y., 2013. , Vol 87. Springer Science \& Business Media. DOI: 10.1007/978-1-4419-8853-9
Natural gradient works efficiently in learning [link]
Amari, S., 1998. Neural computation, Vol 10(2), pp. 251—276. MIT Press. DOI: 10.1162/089976698300017746
Deep Learning, NIPS′2015 Tutorial [PDF]
Hinton, G., Bengio, Y. and LeCun, Y., 2015.
Adaptive restart for accelerated gradient schemes [PDF]
O’Donoghue, B. and Candes, E., 2015. Foundations of computational mathematics, Vol 15(3), pp. 715—732. Springer. DOI: 10.1007/s10208-013-9150-3
The Nth Power of a 2x2 Matrix. [PDF]
Williams, K., 1992. Mathematics Magazine, Vol 65(5), pp. 336. MAA. DOI: 10.2307/2691246
From Averaging to Acceleration, There is Only a Step-size. [PDF]
Flammarion, N. and Bach, F.R., 2015. COLT, pp. 658—695.
On the momentum term in gradient descent learning algorithms [PDF]
Qian, N., 1999. Neural networks, Vol 12(1), pp. 145—151. Elsevier. DOI: 10.1016/s0893-6080(98)00116-6
Understanding deep learning requires rethinking generalization [PDF]
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O., 2016. arXiv preprint arXiv:1611.03530.
A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights [PDF]
Su, W., Boyd, S. and Candes, E., 2014. Advances in Neural Information Processing Systems, pp. 2510—2518.
The Zen of Gradient Descent [HTML]
Hardt, M., 2013.
A geometric alternative to Nesterov’s accelerated gradient descent [PDF]
Bubeck, S., Lee, Y.T. and Singh, M., 2015. arXiv preprint arXiv:1506.08187.
An optimal first order method based on optimal quadratic averaging [PDF]
Drusvyatskiy, D., Fazel, M. and Roy, S., 2016. arXiv preprint arXiv:1604.06543.
Linear coupling: An ultimate unification of gradient and mirror descent [PDF]
Allen-Zhu, Z. and Orecchia, L., 2014. arXiv preprint arXiv:1407.1537.
Accelerating the cubic regularization of Newton’s method on convex problems [PDF]
Nesterov, Y., 2008. Mathematical Programming, Vol 112(1), pp. 159—181. Springer. DOI: 10.1007/s10107-006-0089-x

Updates and Corrections

View all changes to this article since it was first published. If you see a mistake or want to suggest a change, please create an issue on GitHub.

Citations and Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 2.0, unless noted otherwise, with the source available on GitHub. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: “Figure from …”.

For attribution in academic contexts, please cite this work as

Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006

BibTeX citation

@article{goh2017why,
  author = {Goh, Gabriel},
  title = {Why Momentum Really Works},
  journal = {Distill},
  year = {2017},
  url = {http://distill.pub/2017/momentum},
  doi = {10.23915/distill.00006}
}

Why Momentum Really Works

First Steps: Gradient Descent

Decomposing the Error

Choosing A Step-size

Example: Polynomial Regression

The Dynamics of Momentum

The Critical Damping Coefficient

Optimal parameters

Example: The Colorization Problem

The Limits of Descent

Adventures in Algorithmic Space

The Resisting Oracle

Momentum with Stochastic Gradients

Onwards and Downwards