- Title: Adam: A Method for Stochastic Optimization
- Authors: Diederik Kingma, Jimmy Ba
- Link: https://arxiv.org/abs/1412.6980
- Tags: Neural Network, optimizer
- Year: 2015
-
What
- They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp.
- Stochastic optimization methods have to find parameters that minimize/maximize a stochastic function.
- A function is stochastic (non-deterministic), if the same set of parameters can generate different results. E.g. the loss of different mini-batches can differ, even when the parameters remain unchanged. Even for the same mini-batch the results can change due to e.g. dropout.
- Their method tends to converge faster to optimal parameters than the existing competitors.
- Their method can deal with non-stationary distributions (similar to e.g. SGD, Adadelta, RMSProp).
- Their method can deal with very sparse or noisy gradients (similar to e.g. Adagrad).
- They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp.
-
How
- Basic principle
- Standard SGD just updates the parameters based on
parameters = parameters - learningRate * gradient. - Adam operates similar to that, but adds more "cleverness" to the rule.
- It assumes that the gradient values have means and variances and tries to estimate these values.
- Recall here that the function to optimize is stochastic, so there is some randomness in the gradients.
- The mean is also called "the first moment".
- The variance is also called "the second (raw) moment".
- Then an update rule very similar to SGD would be
parameters = parameters - learningRate * means. - They instead use the update rule
parameters = parameters - learningRate * means/sqrt(variances).- They call
means/sqrt(variances)a 'Signal to Noise Ratio'. - Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via
learningRate * mean/sqrt(highValue). - If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via
learningRate * mean/sqrt(lowValue).
- They call
- Standard SGD just updates the parameters based on
- Exponential moving averages
- In order to approximate the mean and variance values you could simply save the last
Tgradients and then average the values. - That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs).
- A simple average also has the disadvantage, that it would completely ignore all gradients before
Tand weight all of the lastTgradients identically. In reality, you might want to give more weight to the last couple of gradients. - Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula
avg = alpha * avg + (1 - alpha) * avg. - Let the gradient at timestep (batch)
tbeg, then we can approximate the mean and variance values using:mean = beta1 * mean + (1 - beta1) * gvariance = beta2 * variance + (1 - beta2) * g^2.beta1andbeta2are hyperparameters of the algorithm. Good values for them seem to bebeta1=0.9andbeta2=0.999.- At the start of the algorithm,
meanandvarianceare initialized to zero-vectors.
- In order to approximate the mean and variance values you could simply save the last
- Bias correction
- Initializing the
meanandvariancevectors to zero is an easy and logical step, but has the disadvantage that bias is introduced. - E.g. at the first timestep, the mean of the gradient would be
mean = beta1 * 0 + (1 - beta1) * g, withbeta1=0.9then:mean = 0.9 * g. So0.9g, notg. Both the mean and the variance are biased (towards 0). - This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit.
- So to fix this pretty they perform bias-corrections of the mean and the variance:
correctedMean = mean / (1-beta1^t)(wheretis the timestep).correctedVariance = variance / (1-beta2^t).- Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep).
- Initializing the
- Basic principle
