You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: site/resource_pages/optimizer_summary.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,25 +20,25 @@ This document summarizes the four optimizers compared in the Lesson 30 demo.
20
20
## Optimizers overview
21
21
22
22
### 1. SGD (Stochastic Gradient Descent)
23
-
Vanilla gradient descent that updates parameters based on the gradient of the loss function. When `batch_size=1`, it's true stochastic gradient descent; with larger batches, it becomes mini-batch gradient descent. Simple but can be slow to converge and sensitive to learning rate choice.
23
+
Vanilla gradient descent that updates parameters based on the gradient of the loss function. When `batch_size=1`, it's true stochastic gradient descent; with larger batches, it becomes mini-batch gradient descent. Simple but can be slow to converge and sensitive to learning rate choice. Citation: [Robbins and Monro, 1951](https://projecteuclid.org/euclid.aoms/1177729586).
24
24
25
25
### 2. SGD + Momentum
26
-
Extends vanilla SGD by accumulating a velocity vector in directions of persistent gradient descent. This helps accelerate convergence in relevant directions and dampens oscillations. A momentum value of 0.9 is commonly used.
26
+
Extends vanilla SGD by accumulating a velocity vector in directions of persistent gradient descent. This helps accelerate convergence in relevant directions and dampens oscillations. A momentum value of 0.9 is commonly used. Citation: [Polyak, 1964](https://doi.org/10.1016/0041-5553(64)90137-5).
27
27
28
28
### 3. RMSprop (Root Mean Square Propagation)
29
-
An adaptive learning rate optimizer that divides the learning rate by an exponentially decaying average of squared gradients. This allows the optimizer to use larger steps for infrequent features and smaller steps for frequent ones, making it well-suited for non-stationary objectives.
29
+
An adaptive learning rate optimizer that divides the learning rate by an exponentially decaying average of squared gradients. This allows the optimizer to use larger steps for infrequent features and smaller steps for frequent ones, making it well-suited for non-stationary objectives. Citation: [Hinton, 2012](https://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf).
30
30
31
31
### 4. Adam (Adaptive Moment Estimation)
32
-
Combines the best of momentum and RMSprop. It computes adaptive learning rates for each parameter using estimates of both first-order moments (mean) and second-order moments (variance) of the gradients. Adam is often the default choice due to its robustness across different problems.
32
+
Combines the best of momentum and RMSprop. It computes adaptive learning rates for each parameter using estimates of both first-order moments (mean) and second-order moments (variance) of the gradients. Adam is often the default choice due to its robustness across different problems. Citation: [Kingma and Ba, 2014](https://arxiv.org/abs/1412.6980).
0 commit comments