You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Short for *Adaptive Gradient*, the AdaGrad Optimizer speeds up the learning of parameters that do not change often and slows down the learning of parameters that do enjoy heavy activity. Due to AdaGrad's infinitely decaying step size, training may be slow or fail to converge using a low learning rate.
5
5
6
+
## Mathematical formulation
7
+
Per step (element-wise), AdaGrad accumulates the sum of squared gradients and scales the update by the root of this sum:
Short for *Adaptive Moment Estimation*, the Adam Optimizer combines both Momentum and RMS properties. In addition to storing an exponentially decaying average of past squared gradients like [RMSprop](rms-prop.md), Adam also keeps an exponentially decaying average of past gradients, similar to [Momentum](momentum.md). Whereas Momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction.
5
5
6
+
## Mathematical formulation
7
+
Per step (element-wise), Adam maintains exponentially decaying moving averages of the gradient and its element-wise square and uses them to scale the update:
A version of the [Adam](adam.md) optimizer that replaces the RMS property with the infinity norm of the past gradients. As such, AdaMax is generally more suitable for sparse parameter updates and noisy gradients.
5
5
6
+
## Mathematical formulation
7
+
Per step (element-wise), AdaMax maintains an exponentially decaying moving average of the gradient (velocity) and an infinity-norm accumulator of past gradients, and uses them to scale the update:
The Cyclical optimizer uses a global learning rate that cycles between the lower and upper bound over a designated period while also decaying the upper bound by a factor at each step. Cyclical learning rates have been shown to help escape bad local minima and saddle points of the gradient.
5
5
6
+
## Mathematical formulation
7
+
Per step (element-wise), the cyclical learning rate and update are computed as:
Momentum accelerates each update step by accumulating velocity from past updates and adding a factor of the previous velocity to the current step. Momentum can help speed up training and escape bad local minima when compared with [Stochastic](stochastic.md) Gradient Descent.
5
5
6
+
## Mathematical formulation
7
+
Per step (element-wise), Momentum updates the velocity and applies it as the parameter step:
An adaptive gradient technique that divides the current gradient over a rolling window of the magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.
4
+
An adaptive gradient technique that divides the current gradient over a rolling window of magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.
5
+
6
+
## Mathematical formulation
7
+
Per step (element-wise), RMSProp maintains a running average of squared gradients and scales the step by the root-mean-square:
A learning rate decay optimizer that reduces the global learning rate by a factor whenever it reaches a new *floor*. The number of steps needed to reach a new floor is defined by the *steps* hyper-parameter.
5
5
6
+
## Mathematical formulation
7
+
Per step (element-wise), the Step Decay learning rate and update are:
0 commit comments