Skip to content

Commit afe14f8

Browse files
SGD: add Robbins–Monro context and bounded-variance assumption
1 parent 23d7a5a commit afe14f8

File tree

3 files changed

+37
-1
lines changed

3 files changed

+37
-1
lines changed

chapter_optimization/sgd.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,20 @@ $$\mathbb{E}_i \nabla f_i(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\m
6060

6161
This means that, on average, the stochastic gradient is a good estimate of the gradient.
6262

63+
Modern analyses of SGD trace back to the **Robbins–Monro** stochastic approximation framework :cite:`robbins1951stochastic`. In that view, SGD seeks a root of the gradient mapping using a decreasing stepsize sequence that satisfies
64+
65+
$$
66+
\sum_{t=0}^\infty \eta_t = \infty \quad \text{and} \quad \sum_{t=0}^\infty \eta_t^2 < \infty,
67+
$$
68+
69+
which ensures convergence under mild regularity conditions. A standard assumption in contemporary proofs is **bounded variance** of the stochastic gradient noise:
70+
71+
$$
72+
\mathbb{E}\!\left[\|\nabla f_{i_t}(x_t)-\nabla f(x_t)\|^2 \mid x_t\right] \le \sigma^2 .
73+
$$
74+
75+
together with lower-bounded objectives and smoothness (Lipschitz gradients). Under these, one obtains the classical $\mathcal{O}(1/\sqrt{T})$ rate for convex problems with appropriately decayed $\{\eta_t\}$, and constant-stepsize variants converge to a noise-dominated neighborhood. See :cite:`bottou2018optimization` for a tutorial-style survey.
76+
6377
Now, we will compare it with gradient descent by adding random noise with a mean of 0 and a variance of 1 to the gradient to simulate a stochastic gradient descent.
6478

6579
```{.python .input}

config.ini

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ resources = img/ d2l/ d2l.bib setup.py static/latex_style/
2929
exclusions = README.md STYLE_GUIDE.md INFO.md CODE_OF_CONDUCT.md CONTRIBUTING.md contrib/*md
3030

3131
# If True (default), then will evaluate the notebook to obtain outputs.
32-
eval_notebook = True
32+
eval_notebook = False
3333

3434
tabs = pytorch, mxnet, jax, tensorflow
3535

d2l.bib

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4510,3 +4510,25 @@ @article{penedo2023refinedweb
45104510
journal={Ar{X}iv:2306.01116},
45114511
year={2023}
45124512
}
4513+
4514+
@article{robbins1951stochastic,
4515+
author = {Robbins, Herbert and Monro, Sutton},
4516+
title = {A Stochastic Approximation Method},
4517+
journal = {The Annals of Mathematical Statistics},
4518+
year = {1951},
4519+
volume = {22},
4520+
number = {3},
4521+
pages = {400--407},
4522+
doi = {10.1214/aoms/1177729586}
4523+
}
4524+
4525+
@article{bottou2018optimization,
4526+
author = {Bottou, L{\'e}on and Curtis, Frank E. and Nocedal, Jorge},
4527+
title = {Optimization Methods for Large-Scale Machine Learning},
4528+
journal = {SIAM Review},
4529+
year = {2018},
4530+
volume = {60},
4531+
number = {2},
4532+
pages = {223--311},
4533+
doi = {10.1137/16M1080173}
4534+
}

0 commit comments

Comments
 (0)