Skip to content

Commit 05ea6b4

Browse files
authored
Merge pull request #391 from apphp/390-convert-optimizers-class-to-NumPower
390 convert optimizers class to num power
2 parents c058f14 + 5255d64 commit 05ea6b4

File tree

27 files changed

+2046
-38
lines changed

27 files changed

+2046
-38
lines changed
Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,35 @@
1-
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/AdaGrad.php">[source]</a></span>
1+
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/AdaGrad/AdaGrad.php">[source]</a></span>
22

33
# AdaGrad
44
Short for *Adaptive Gradient*, the AdaGrad Optimizer speeds up the learning of parameters that do not change often and slows down the learning of parameters that do enjoy heavy activity. Due to AdaGrad's infinitely decaying step size, training may be slow or fail to converge using a low learning rate.
55

6+
## Mathematical formulation
7+
Per step (element-wise), AdaGrad accumulates the sum of squared gradients and scales the update by the root of this sum:
8+
9+
$$
10+
\begin{aligned}
11+
\mathbf{n}_t &= \mathbf{n}_{t-1} + \mathbf{g}_t^{2} \\
12+
\Delta{\theta}_t &= \alpha\, \frac{\mathbf{g}_t}{\sqrt{\mathbf{n}_t} + \varepsilon}
13+
\end{aligned}
14+
$$
15+
16+
where:
17+
- $t$ is the current step,
18+
- $\alpha$ is the learning rate (`rate`),
19+
- $\mathbf{g}_t$ is the current gradient, and $\mathbf{g}_t^{2}$ denotes element-wise square,
20+
- $\varepsilon$ is a small constant for numerical stability (in the implementation, the denominator is clipped from below by `EPSILON`).
21+
622
## Parameters
723
| # | Name | Default | Type | Description |
824
|---|---|---|---|---|
925
| 1 | rate | 0.01 | float | The learning rate that controls the global step size. |
1026

1127
## Example
1228
```php
13-
use Rubix\ML\NeuralNet\Optimizers\AdaGrad;
29+
use Rubix\ML\NeuralNet\Optimizers\AdaGrad\AdaGrad;
1430

1531
$optimizer = new AdaGrad(0.125);
1632
```
1733

1834
## References
19-
[^1]: J. Duchi et al. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
35+
[^1]: J. Duchi et al. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,27 @@
1-
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Adam.php">[source]</a></span>
1+
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Adam/Adam.php">[source]</a></span>
22

33
# Adam
44
Short for *Adaptive Moment Estimation*, the Adam Optimizer combines both Momentum and RMS properties. In addition to storing an exponentially decaying average of past squared gradients like [RMSprop](rms-prop.md), Adam also keeps an exponentially decaying average of past gradients, similar to [Momentum](momentum.md). Whereas Momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction.
55

6+
## Mathematical formulation
7+
Per step (element-wise), Adam maintains exponentially decaying moving averages of the gradient and its element-wise square and uses them to scale the update:
8+
9+
$$
10+
\begin{aligned}
11+
\mathbf{v}_t &= (1 - \beta_1)\,\mathbf{v}_{t-1} + \beta_1\,\mathbf{g}_t \\
12+
\mathbf{n}_t &= (1 - \beta_2)\,\mathbf{n}_{t-1} + \beta_2\,\mathbf{g}_t^{2} \\
13+
\Delta{\theta}_t &= \alpha\, \frac{\mathbf{v}_t}{\sqrt{\mathbf{n}_t} + \varepsilon}
14+
\end{aligned}
15+
$$
16+
17+
where:
18+
- $t$ is the current step,
19+
- $\alpha$ is the learning rate (`rate`),
20+
- $\beta_1$ is the momentum decay (`momentumDecay`),
21+
- $\beta_2$ is the norm decay (`normDecay`),
22+
- $\mathbf{g}_t$ is the current gradient, and $\mathbf{g}_t^{2}$ denotes element-wise square,
23+
- $\varepsilon$ is a small constant for numerical stability (in the implementation, the denominator is clipped from below by `EPSILON`).
24+
625
## Parameters
726
| # | Name | Default | Type | Description |
827
|---|---|---|---|---|
@@ -12,10 +31,10 @@ Short for *Adaptive Moment Estimation*, the Adam Optimizer combines both Momentu
1231

1332
## Example
1433
```php
15-
use Rubix\ML\NeuralNet\Optimizers\Adam;
34+
use Rubix\ML\NeuralNet\Optimizers\Adam\Adam;
1635

1736
$optimizer = new Adam(0.0001, 0.1, 0.001);
1837
```
1938

2039
## References
21-
[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
40+
[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,27 @@
1-
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/AdaMax.php">[source]</a></span>
1+
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/AdaMax/AdaMax.php">[source]</a></span>
22

33
# AdaMax
44
A version of the [Adam](adam.md) optimizer that replaces the RMS property with the infinity norm of the past gradients. As such, AdaMax is generally more suitable for sparse parameter updates and noisy gradients.
55

6+
## Mathematical formulation
7+
Per step (element-wise), AdaMax maintains an exponentially decaying moving average of the gradient (velocity) and an infinity-norm accumulator of past gradients, and uses them to scale the update:
8+
9+
$$
10+
\begin{aligned}
11+
\mathbf{v}_t &= (1 - \beta_1)\,\mathbf{v}_{t-1} + \beta_1\,\mathbf{g}_t \\
12+
\mathbf{u}_t &= \max\big(\beta_2\,\mathbf{u}_{t-1},\ |\mathbf{g}_t|\big) \\
13+
\Delta{\theta}_t &= \alpha\, \frac{\mathbf{v}_t}{\max(\mathbf{u}_t, \varepsilon)}
14+
\end{aligned}
15+
$$
16+
17+
where:
18+
- $t$ is the current step,
19+
- $\alpha$ is the learning rate (`rate`),
20+
- $\beta_1$ is the momentum decay (`momentumDecay`),
21+
- $\beta_2$ is the norm decay (`normDecay`),
22+
- $\mathbf{g}_t$ is the current gradient and $|\mathbf{g}_t|$ denotes element-wise absolute value,
23+
- $\varepsilon$ is a small constant for numerical stability (in the implementation, the denominator is clipped from below by `EPSILON`).
24+
625
## Parameters
726
| # | Name | Default | Type | Description |
827
|---|---|---|---|---|
@@ -12,10 +31,10 @@ A version of the [Adam](adam.md) optimizer that replaces the RMS property with t
1231

1332
## Example
1433
```php
15-
use Rubix\ML\NeuralNet\Optimizers\AdaMax;
34+
use Rubix\ML\NeuralNet\Optimizers\AdaMax\AdaMax;
1635

1736
$optimizer = new AdaMax(0.0001, 0.1, 0.001);
1837
```
1938

2039
## References
21-
[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
40+
[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,28 @@
1-
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Cyclical.php">[source]</a></span>
1+
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Cyclical/Cyclical.php">[source]</a></span>
22

33
# Cyclical
44
The Cyclical optimizer uses a global learning rate that cycles between the lower and upper bound over a designated period while also decaying the upper bound by a factor at each step. Cyclical learning rates have been shown to help escape bad local minima and saddle points of the gradient.
55

6+
## Mathematical formulation
7+
Per step (element-wise), the cyclical learning rate and update are computed as:
8+
9+
$$
10+
\begin{aligned}
11+
\text{cycle} &= \left\lfloor 1 + \frac{t}{2\,\text{steps}} \right\rfloor \\
12+
x &= \left| \frac{t}{\text{steps}} - 2\,\text{cycle} + 1 \right| \\
13+
\text{scale} &= \text{decay}^{\,t} \\
14+
\eta_t &= \text{lower} + (\text{upper} - \text{lower})\,\max\bigl(0,1 - x\bigr)\,\text{scale} \\
15+
\Delta\theta_t &= \eta_t\,g_t
16+
\end{aligned}
17+
$$
18+
19+
where:
20+
- $t$ is the current step counter,
21+
- $steps$ is the number of steps in every half cycle,
22+
- $lower$ and $upper$ are the learning rate bounds,
23+
- $decay$ is the multiplicative decay applied each step,
24+
- $g_t$ is the current gradient.
25+
626
## Parameters
727
| # | Name | Default | Type | Description |
828
|---|---|---|---|---|
@@ -13,10 +33,10 @@ The Cyclical optimizer uses a global learning rate that cycles between the lower
1333

1434
## Example
1535
```php
16-
use Rubix\ML\NeuralNet\Optimizers\Cyclical;
36+
use Rubix\ML\NeuralNet\Optimizers\Cyclical\Cyclical;
1737

1838
$optimizer = new Cyclical(0.001, 0.005, 1000);
1939
```
2040

2141
## References
22-
[^1]: L. N. Smith. (2017). Cyclical Learning Rates for Training Neural Networks.
42+
[^1]: L. N. Smith. (2017). Cyclical Learning Rates for Training Neural Networks.

docs/neural-network/optimizers/momentum.md

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,33 @@
1-
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Momentum.php">[source]</a></span>
1+
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Momentum/Momentum.php">[source]</a></span>
22

33
# Momentum
44
Momentum accelerates each update step by accumulating velocity from past updates and adding a factor of the previous velocity to the current step. Momentum can help speed up training and escape bad local minima when compared with [Stochastic](stochastic.md) Gradient Descent.
55

6+
## Mathematical formulation
7+
Per step (element-wise), Momentum updates the velocity and applies it as the parameter step:
8+
9+
$$
10+
\begin{aligned}
11+
\beta &= 1 - \text{decay}, \quad \eta = \text{rate} \\
12+
\text{Velocity update:}\quad v_t &= \beta\,v_{t-1} + \eta\,g_t \\
13+
\text{Returned step:}\quad \Delta\theta_t &= v_t
14+
\end{aligned}
15+
$$
16+
17+
Nesterov lookahead (when `lookahead = true`) is approximated by applying the velocity update a second time:
18+
19+
$$
20+
\begin{aligned}
21+
v_t &\leftarrow \beta\,v_t + \eta\,g_t
22+
\end{aligned}
23+
$$
24+
25+
where:
26+
- $g_t$ is the current gradient,
27+
- $v_t$ is the velocity (accumulated update),
28+
- $\beta$ is the momentum coefficient ($1 − decay$),
29+
- $\eta$ is the learning rate ($rate$).
30+
631
## Parameters
732
| # | Name | Default | Type | Description |
833
|---|---|---|---|---|
@@ -12,7 +37,7 @@ Momentum accelerates each update step by accumulating velocity from past updates
1237

1338
## Example
1439
```php
15-
use Rubix\ML\NeuralNet\Optimizers\Momentum;
40+
use Rubix\ML\NeuralNet\Optimizers\Momentum\Momentum;
1641

1742
$optimizer = new Momentum(0.01, 0.1, true);
1843
```
Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,25 @@
1-
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/RMSProp.php">[source]</a></span>
1+
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/RMSProp/RMSProp.php">[source]</a></span>
22

33
# RMS Prop
4-
An adaptive gradient technique that divides the current gradient over a rolling window of the magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.
4+
An adaptive gradient technique that divides the current gradient over a rolling window of magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.
5+
6+
## Mathematical formulation
7+
Per step (element-wise), RMSProp maintains a running average of squared gradients and scales the step by the root-mean-square:
8+
9+
$$
10+
\begin{aligned}
11+
\rho &= 1 - \text{decay}, \quad \eta = \text{rate} \\
12+
\text{Running average:}\quad v_t &= \rho\,v_{t-1} + (1 - \rho)\,g_t^{\,2} \\
13+
\text{Returned step:}\quad \Delta\theta_t &= \frac{\eta\,g_t}{\max\bigl(\sqrt{v_t},\,\varepsilon\bigr)}
14+
\end{aligned}
15+
$$
16+
17+
where:
18+
- $g_t$ is the current gradient,
19+
- $v_t$ is the running average of squared gradients,
20+
- $\rho$ is the averaging coefficient ($1 − decay$),
21+
- $\eta$ is the learning rate ($rate$),
22+
- $\varepsilon$ is a small constant to avoid division by zero (implemented by clipping $\sqrt{v_t}$ to $[ε, +∞)$).
523

624
## Parameters
725
| # | Name | Default | Type | Description |
@@ -11,10 +29,10 @@ An adaptive gradient technique that divides the current gradient over a rolling
1129

1230
## Example
1331
```php
14-
use Rubix\ML\NeuralNet\Optimizers\RMSProp;
32+
use Rubix\ML\NeuralNet\Optimizers\RMSProp\RMSProp;
1533

1634
$optimizer = new RMSProp(0.01, 0.1);
1735
```
1836

1937
## References
20-
[^1]: T. Tieleman et al. (2012). Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude.
38+
[^1]: T. Tieleman et al. (2012). Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude.
Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,26 @@
1-
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/StepDecay.php">[source]</a></span>
1+
<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/StepDecay/StepDecay.php">[source]</a></span>
22

33
# Step Decay
44
A learning rate decay optimizer that reduces the global learning rate by a factor whenever it reaches a new *floor*. The number of steps needed to reach a new floor is defined by the *steps* hyper-parameter.
55

6+
## Mathematical formulation
7+
Per step (element-wise), the Step Decay learning rate and update are:
8+
9+
$$
10+
\begin{aligned}
11+
\text{floor} &= \left\lfloor \frac{t}{k} \right\rfloor \\
12+
\eta_t &= \frac{\eta_0}{1 + \text{floor}\cdot \lambda} \\
13+
\Delta\theta_t &= \eta_t\,g_t
14+
\end{aligned}
15+
$$
16+
17+
where:
18+
- $t$ is the current step number,
19+
- $k$ is the number of steps per floor,
20+
- $\eta_0$ is the initial learning rate ($rate$),
21+
- $\lambda$ is the decay factor ($decay$),
22+
- $g_t$ is the current gradient.
23+
624
## Parameters
725
| # | Name | Default | Type | Description |
826
|---|---|---|---|---|
@@ -12,7 +30,7 @@ A learning rate decay optimizer that reduces the global learning rate by a facto
1230

1331
## Example
1432
```php
15-
use Rubix\ML\NeuralNet\Optimizers\StepDecay;
33+
use Rubix\ML\NeuralNet\Optimizers\StepDecay\StepDecay;
1634

1735
$optimizer = new StepDecay(0.1, 50, 1e-3);
18-
```
36+
```

docs/neural-network/optimizers/stochastic.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,20 @@
33
# Stochastic
44
A constant learning rate optimizer based on vanilla Stochastic Gradient Descent (SGD).
55

6+
## Mathematical formulation
7+
Per step (element-wise), the SGD update scales the gradient by a constant learning rate:
8+
9+
$$
10+
\begin{aligned}
11+
\eta &= \text{rate} \\
12+
\Delta\theta_t &= \eta\,g_t
13+
\end{aligned}
14+
$$
15+
16+
where:
17+
- $g_t$ is the current gradient,
18+
- $\eta$ is the learning rate ($rate$).
19+
620
## Parameters
721
| # | Name | Default | Type | Description |
822
|---|---|---|---|---|

0 commit comments

Comments
 (0)