|
1823 | 1823 | "#### Adaptive Gradient Algorithm (AdaGrad) ####\n", |
1824 | 1824 | "The first extension to gradient descent that we'll cover that is motivated by adaptive learning rates is <b>AdaGrad</b>. AdaGrad implements parameter specific learning rates that are updated at each step based on the local gradient with respect to each parameter. We will still have a base learning rate $\\alpha$ but parameters with large gradients will scale this rate to be smaller, and parameters with small gradients will scale this rate to be larger. The form of the update is:\n", |
1825 | 1825 | "\n", |
1826 | | - "$$ \\mathbf{A}_\\mathrm{i+1} = \\mathbf{A}_\\mathrm{i} + \\mathrm{diag}\\left[\\boldsymbol{\\nabla}_{\\beta}f\\left(\\beta_\\mathrm{i}\\right)\\right]^2 $$ \n", |
| 1826 | + "$$ \\mathbf{A}_{i+1} = \\mathbf{A}_i + \\left(\\nabla_{\\beta} f\\left(\\boldsymbol{\\beta}_i\\right)\\right)\\odot\\left(\\nabla_{\\beta} f\\left(\\boldsymbol{\\beta}_i\\right)\\right) $$\n", |
1827 | 1827 | "\n", |
1828 | | - "$$ \\boldsymbol{\\beta}_\\mathrm{i+1} = \\boldsymbol{\\beta}_\\mathrm{i} - \\frac{\\mathbf{\\alpha}}{\\sqrt{\\mathbf{A}_\\mathrm{i+1}}+\\mathrm{diag}\\left(\\varepsilon\\right)} \\boldsymbol{\\nabla}_{\\beta}f\\left(\\boldsymbol{\\beta}_\\mathrm{i}\\right)$$\n", |
| 1828 | + "$$ \\boldsymbol{\\beta}_{i+1} = \\boldsymbol{\\beta}_i - \\alpha \\frac{\\nabla_{\\beta} f\\left(\\boldsymbol{\\beta}_i\\right)}{\\sqrt{\\mathbf{A}_{i+1}} + \\varepsilon} $$\n", |
1829 | 1829 | "\n", |
1830 | | - "Here, $\\mathbf{A}$ is the quantity that governs the update to the effective learning rate and $\\alpha$ is a base learning rate. $\\mathrm{diag}()$ refers to a diagonal matrix with zeroes for the off-diagonal elements. Thus $\\mathrm{diag}\\left[\\mathbf{\\nabla}_{\\beta}f\\left(\\boldsymbol{\\beta}_\\mathrm{i}\\right)\\right]^2$ has the square of each partial derivative along the diagonal of the matrix. We use the square because we care about the magnitude of the gradient, not its sign. In the expression for $\\boldsymbol{\\beta}$, $\\mathbf{A}$ appears in the denominator, and we take the square root. Dividing by a matrix isn't valid linear algebra, but here the notation is used to indicate the dot product between the inverse of the diagonal matrix in the denominator with the $\\mathbf{\\nabla}_{\\beta}f\\left(\\boldsymbol{\\beta}_\\mathrm{i}\\right)$ vector. Finally, $\\varepsilon$ is typically a small fixed value (e.g., 1E-4) and the $\\mathrm{diag}\\left(\\varepsilon\\right)$ matrix is used with to avoid division by zero if elements in $\\mathbf{A}$ are too small. \n", |
| 1830 | + "Here, $\\mathbf{A}$ is a running sum of squared gradients (element-wise) and $\\alpha$ is a base learning rate. We square the gradient because we care about its magnitude, not its sign. The $\\varepsilon$ term (e.g., $10^{-4}$) prevents division by zero when elements of $\\mathbf{A}$ are very small. All operations are element-wise, so you can think of $\\mathbf{A}$ as a vector.\n", |
1831 | 1831 | "\n", |
1832 | | - "A lot of text went into explaining the AdaGrad equations, but the modification is actually very intuitive. At each step, the effective learning rate for each parameter is the base learning rate divided by the (cumulative) magnitude of its partial derivative. For parameters with small gradients, this increases the effective learning rate, for parameters with large gradients, this decreases the effective learning rate. That's really all there is to it, the $\\varepsilon$ thing is just to deal with pathological cases. Likewise, we use diagonal matrices, because the square root can be easily calculated. However, \"full-matrix\" extensions to adagrad are actively researched, but won't be covered here. Since we are only using the diagonal of the $\\mathbf{A}$ here, you could also just think of $\\mathbf{A}$ as being a vector and the division as being element-wise, but I have used the $\\mathrm{diag}(\\mathbf{A})$ form to imply the generalization. \n", |
| 1832 | + "The modification is actually very intuitive. At each step, the effective learning rate for each parameter is $\\alpha /(\\sqrt{\\mathbf{A}}+\\varepsilon)$. For parameters with small gradients, $\\mathbf{A}$ grows slowly and the effective learning rate stays large; for parameters with large gradients, $\\mathbf{A}$ grows quickly and the effective learning rate shrinks. The $\\varepsilon$ term just avoids division by zero.\n", |
1833 | 1833 | "\n", |
1834 | 1834 | "If you are still paying attention, you can also see a potential problem with AdaGrad, in that the matrix $\\mathbf{A}$ just keeps getting bigger and bigger (and correspondingly, the effective learning rate will monotonically decrease). For this reason AdaGrad is prone to stopping too early and we'll address that with our next adaptive learning rate algorithm (rmsprop). \n", |
1835 | 1835 | "\n", |
|
1971 | 1971 | "\n", |
1972 | 1972 | "(sec-gbo-rmsprop)=\n", |
1973 | 1973 | "#### Root Mean Square Propagation (RMSProp) ####\n", |
1974 | | - "While experimenting with AdaGrad, you may have noticed that despite the more direct path towards the minimum the optimization took many epochs to converge for typical base learning rates. The expression for `A` used by AdaGrad is the cumulative sum of the gradients, so `A` monotonically increases resulting in very small effective learning rates towards the end of the optimization. The idea behind RMSProp is that instead of calculating `A` based on all of the previous updates, we instead use only a fraction of the previous `A` in calculating the update (similar to momentum). The equations governing RMSProp are:\n", |
| 1974 | + "While experimenting with AdaGrad, you may have noticed that despite the more direct path towards the minimum the optimization took many epochs to converge for typical base learning rates. The expression for `A` used by AdaGrad is the cumulative sum of the gradients, so `A` monotonically increases resulting in very small effective learning rates towards the end of the optimization. The idea behind RMSProp is that instead of calculating `A` based on all of the previous updates, we instead use only a fraction of the previous `A` in calculating the update (similar to momentum). The equations governing RMSProp are (element-wise):\n", |
1975 | 1975 | "\n", |
1976 | | - "$$ \\mathbf{A}_\\mathrm{i+1} = \\eta\\mathbf{A}_\\mathrm{i} + \\left(1-\\eta\\right) \\mathrm{diag}\\left[\\boldsymbol{\\nabla}_{\\beta}f\\left(\\boldsymbol{\\beta}_\\mathrm{i}\\right)\\right]^2 $$ \n", |
| 1976 | + "$$ \\mathbf{A}_{i+1} = \\eta \\mathbf{A}_i + \\left(1-\\eta\\right) \\left(\\nabla_{\\beta} f\\left(\\boldsymbol{\\beta}_i\\right)\\right)\\odot\\left(\\nabla_{\\beta} f\\left(\\boldsymbol{\\beta}_i\\right)\\right) $$\n", |
1977 | 1977 | "\n", |
1978 | | - "$$ \\boldsymbol{\\beta}_\\mathrm{i+1} = \\boldsymbol{\\beta}_\\mathrm{i} - \\frac{\\mathbf{\\alpha}}{\\sqrt{\\mathbf{A}_\\mathrm{i+1}}+\\mathrm{diag}\\left(\\varepsilon\\right)} \\boldsymbol{\\nabla}_{\\beta}f\\left(\\boldsymbol{\\beta}_\\mathrm{i}\\right)$$\n", |
| 1978 | + "$$ \\boldsymbol{\\beta}_{i+1} = \\boldsymbol{\\beta}_i - \\alpha \\frac{\\nabla_{\\beta} f\\left(\\boldsymbol{\\beta}_i\\right)}{\\sqrt{\\mathbf{A}_{i+1}} + \\varepsilon} $$\n", |
1979 | 1979 | "\n", |
1980 | 1980 | "Here, $\\eta$ is a new hyperparameter that determines what fraction of the old `A` to mix in with the current (squared) gradient (typically, $\\eta=0.9,0.99,0.999$). The expression for the $\\boldsymbol{\\beta}$ update is unchanged with respect to AdaGrad. \n", |
1981 | 1981 | "\n", |
|
0 commit comments