Home Page

Papers

Submissions

News

Editorial Board

Special Issues

Open Source Software

Proceedings (PMLR)

Data (DMLR)

Transactions (TMLR)

Search

Statistics

Login

Frequently Asked Questions

Contact Us



RSS Feed

On the Hyperparameters in Stochastic Gradient Descent with Momentum

Bin Shi; 25(236):1−40, 2024.

Abstract

Following the same routine as Shi et al. (2023), we continue to present the theoretical analysis for stochastic gradient descent with momentum (SGD with momentum) in this paper. Differently, for SGD with momentum, we demonstrate that the two hyperparameters together, the learning rate and the momentum coefficient, play a significant role in the linear convergence rate in non-convex optimizations. Our analysis is based on using a hyperparameters-dependent stochastic differential equation (hp-dependent SDE) that serves as a continuous surrogate for SGD with momentum. Similarly, we establish the linear convergence for the continuous-time formulation of SGD with momentum and obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Kramers-Fokker-Planck operator. By comparison, we demonstrate how the optimal linear rate of convergence and the final gap for SGD only about the learning rate varies with the momentum coefficient increasing from zero to one when the momentum is introduced. Then, we propose a mathematical interpretation of why, in practice, SGD with momentum converges faster and is more robust in the learning rate than standard stochastic gradient descent (SGD). Finally, we show the Nesterov momentum under the presence of noise has no essential difference from the traditional momentum.

[abs][pdf][bib]       
© JMLR 2024. (edit, beta)

Mastodon