Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions chapter_optimization/sgd.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,20 @@ $$\mathbb{E}_i \nabla f_i(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\m

This means that, on average, the stochastic gradient is a good estimate of the gradient.

Modern analyses of SGD trace back to the **Robbins–Monro** stochastic approximation framework :cite:`robbins1951stochastic`. In that view, SGD seeks a root of the gradient mapping using a decreasing stepsize sequence that satisfies

$$
\sum_{t=0}^\infty \eta_t = \infty \quad \text{and} \quad \sum_{t=0}^\infty \eta_t^2 < \infty,
$$

which ensures convergence under mild regularity conditions. A standard assumption in contemporary proofs is **bounded variance** of the stochastic gradient noise:

$$
\mathbb{E}\!\left[\|\nabla f_{i_t}(x_t)-\nabla f(x_t)\|^2 \mid x_t\right] \le \sigma^2 .
$$

together with lower-bounded objectives and smoothness (Lipschitz gradients). Under these, one obtains the classical $\mathcal{O}(1/\sqrt{T})$ rate for convex problems with appropriately decayed $\{\eta_t\}$, and constant-stepsize variants converge to a noise-dominated neighborhood. See :cite:`bottou2018optimization` for a tutorial-style survey.

Now, we will compare it with gradient descent by adding random noise with a mean of 0 and a variance of 1 to the gradient to simulate a stochastic gradient descent.

```{.python .input}
Expand Down
22 changes: 22 additions & 0 deletions d2l.bib
Original file line number Diff line number Diff line change
Expand Up @@ -4510,3 +4510,25 @@ @article{penedo2023refinedweb
journal={Ar{X}iv:2306.01116},
year={2023}
}

@article{robbins1951stochastic,
author = {Robbins, Herbert and Monro, Sutton},
title = {A Stochastic Approximation Method},
journal = {The Annals of Mathematical Statistics},
year = {1951},
volume = {22},
number = {3},
pages = {400--407},
doi = {10.1214/aoms/1177729586}
}

@article{bottou2018optimization,
author = {Bottou, L{\'e}on and Curtis, Frank E. and Nocedal, Jorge},
title = {Optimization Methods for Large-Scale Machine Learning},
journal = {SIAM Review},
year = {2018},
volume = {60},
number = {2},
pages = {223--311},
doi = {10.1137/16M1080173}
}
Loading