diff --git a/chapter_optimization/sgd.md b/chapter_optimization/sgd.md index f076516b7..62960df96 100644 --- a/chapter_optimization/sgd.md +++ b/chapter_optimization/sgd.md @@ -60,6 +60,20 @@ $$\mathbb{E}_i \nabla f_i(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\m This means that, on average, the stochastic gradient is a good estimate of the gradient. +Modern analyses of SGD trace back to the **Robbins–Monro** stochastic approximation framework :cite:`robbins1951stochastic`. In that view, SGD seeks a root of the gradient mapping using a decreasing stepsize sequence that satisfies + +$$ +\sum_{t=0}^\infty \eta_t = \infty \quad \text{and} \quad \sum_{t=0}^\infty \eta_t^2 < \infty, +$$ + +which ensures convergence under mild regularity conditions. A standard assumption in contemporary proofs is **bounded variance** of the stochastic gradient noise: + +$$ +\mathbb{E}\!\left[\|\nabla f_{i_t}(x_t)-\nabla f(x_t)\|^2 \mid x_t\right] \le \sigma^2 . +$$ + +together with lower-bounded objectives and smoothness (Lipschitz gradients). Under these, one obtains the classical $\mathcal{O}(1/\sqrt{T})$ rate for convex problems with appropriately decayed $\{\eta_t\}$, and constant-stepsize variants converge to a noise-dominated neighborhood. See :cite:`bottou2018optimization` for a tutorial-style survey. + Now, we will compare it with gradient descent by adding random noise with a mean of 0 and a variance of 1 to the gradient to simulate a stochastic gradient descent. ```{.python .input} diff --git a/d2l.bib b/d2l.bib index 16ad6ce96..dd5bccf81 100644 --- a/d2l.bib +++ b/d2l.bib @@ -4510,3 +4510,25 @@ @article{penedo2023refinedweb journal={Ar{X}iv:2306.01116}, year={2023} } + +@article{robbins1951stochastic, + author = {Robbins, Herbert and Monro, Sutton}, + title = {A Stochastic Approximation Method}, + journal = {The Annals of Mathematical Statistics}, + year = {1951}, + volume = {22}, + number = {3}, + pages = {400--407}, + doi = {10.1214/aoms/1177729586} +} + +@article{bottou2018optimization, + author = {Bottou, L{\'e}on and Curtis, Frank E. and Nocedal, Jorge}, + title = {Optimization Methods for Large-Scale Machine Learning}, + journal = {SIAM Review}, + year = {2018}, + volume = {60}, + number = {2}, + pages = {223--311}, + doi = {10.1137/16M1080173} +}