RubixML
diff --git a/‎docs/neural-network/optimizers/adagrad.md‎
Lines changed: 19 additions & 3 deletions b/‎docs/neural-network/optimizers/adagrad.md‎
Lines changed: 19 additions & 3 deletions
diff --git a/‎docs/neural-network/optimizers/adam.md‎
Lines changed: 22 additions & 3 deletions b/‎docs/neural-network/optimizers/adam.md‎
Lines changed: 22 additions & 3 deletions
diff --git a/‎docs/neural-network/optimizers/adamax.md‎
Lines changed: 22 additions & 3 deletions b/‎docs/neural-network/optimizers/adamax.md‎
Lines changed: 22 additions & 3 deletions
diff --git a/‎docs/neural-network/optimizers/cyclical.md‎
Lines changed: 23 additions & 3 deletions b/‎docs/neural-network/optimizers/cyclical.md‎
Lines changed: 23 additions & 3 deletions
diff --git a/‎docs/neural-network/optimizers/momentum.md‎
Lines changed: 27 additions & 2 deletions b/‎docs/neural-network/optimizers/momentum.md‎
Lines changed: 27 additions & 2 deletions
diff --git a/‎docs/neural-network/optimizers/rms-prop.md‎
Lines changed: 22 additions & 4 deletions b/‎docs/neural-network/optimizers/rms-prop.md‎
Lines changed: 22 additions & 4 deletions
diff --git a/‎docs/neural-network/optimizers/step-decay.md‎
Lines changed: 21 additions & 3 deletions b/‎docs/neural-network/optimizers/step-decay.md‎
Lines changed: 21 additions & 3 deletions
diff --git a/‎docs/neural-network/optimizers/stochastic.md‎
Lines changed: 14 additions & 0 deletions b/‎docs/neural-network/optimizers/stochastic.md‎
Lines changed: 14 additions & 0 deletions
@@ -1,19 +1,35 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/AdaGrad.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/AdaGrad/AdaGrad.php">[source]</a></span>
 
 # AdaGrad
 Short for *Adaptive Gradient*, the AdaGrad Optimizer speeds up the learning of parameters that do not change often and slows down the learning of parameters that do enjoy heavy activity. Due to AdaGrad's infinitely decaying step size, training may be slow or fail to converge using a low learning rate.
 
+## Mathematical formulation
+Per step (element-wise), AdaGrad accumulates the sum of squared gradients and scales the update by the root of this sum:
+
+$$
+\begin{aligned}
+\mathbf{n}_t &= \mathbf{n}_{t-1} + \mathbf{g}_t^{2} \\
+\Delta{\theta}_t &= \alpha\, \frac{\mathbf{g}_t}{\sqrt{\mathbf{n}_t} + \varepsilon}
+\end{aligned}
+$$
+
+where:
+- $t$ is the current step,
+- $\alpha$ is the learning rate (`rate`),
+- $\mathbf{g}_t$ is the current gradient, and $\mathbf{g}_t^{2}$ denotes element-wise square,
+- $\varepsilon$ is a small constant for numerical stability (in the implementation, the denominator is clipped from below by `EPSILON`).
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
 | 1 | rate | 0.01 | float | The learning rate that controls the global step size. |
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\AdaGrad;
+use Rubix\ML\NeuralNet\Optimizers\AdaGrad\AdaGrad;
 
 $optimizer = new AdaGrad(0.125);
 ```
 
 ## References
-[^1]: J. Duchi et al. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
+[^1]: J. Duchi et al. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.
@@ -1,8 +1,27 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Adam.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Adam/Adam.php">[source]</a></span>
 
 # Adam
 Short for *Adaptive Moment Estimation*, the Adam Optimizer combines both Momentum and RMS properties. In addition to storing an exponentially decaying average of past squared gradients like [RMSprop](rms-prop.md), Adam also keeps an exponentially decaying average of past gradients, similar to [Momentum](momentum.md). Whereas Momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction.
 
+## Mathematical formulation
+Per step (element-wise), Adam maintains exponentially decaying moving averages of the gradient and its element-wise square and uses them to scale the update:
+
+$$
+\begin{aligned}
+\mathbf{v}_t &= (1 - \beta_1)\,\mathbf{v}_{t-1} + \beta_1\,\mathbf{g}_t \\
+\mathbf{n}_t &= (1 - \beta_2)\,\mathbf{n}_{t-1} + \beta_2\,\mathbf{g}_t^{2} \\
+\Delta{\theta}_t &= \alpha\, \frac{\mathbf{v}_t}{\sqrt{\mathbf{n}_t} + \varepsilon}
+\end{aligned}
+$$
+
+where:
+- $t$ is the current step,
+- $\alpha$ is the learning rate (`rate`),
+- $\beta_1$ is the momentum decay (`momentumDecay`),
+- $\beta_2$ is the norm decay (`normDecay`),
+- $\mathbf{g}_t$ is the current gradient, and $\mathbf{g}_t^{2}$ denotes element-wise square,
+- $\varepsilon$ is a small constant for numerical stability (in the implementation, the denominator is clipped from below by `EPSILON`).
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
@@ -12,10 +31,10 @@ Short for *Adaptive Moment Estimation*, the Adam Optimizer combines both Momentu
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\Adam;
+use Rubix\ML\NeuralNet\Optimizers\Adam\Adam;
 
 $optimizer = new Adam(0.0001, 0.1, 0.001);
 ```
 
 ## References
-[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
+[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
@@ -1,8 +1,27 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/AdaMax.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/AdaMax/AdaMax.php">[source]</a></span>
 
 # AdaMax
 A version of the [Adam](adam.md) optimizer that replaces the RMS property with the infinity norm of the past gradients. As such, AdaMax is generally more suitable for sparse parameter updates and noisy gradients.
 
+## Mathematical formulation
+Per step (element-wise), AdaMax maintains an exponentially decaying moving average of the gradient (velocity) and an infinity-norm accumulator of past gradients, and uses them to scale the update:
+
+$$
+\begin{aligned}
+\mathbf{v}_t &= (1 - \beta_1)\,\mathbf{v}_{t-1} + \beta_1\,\mathbf{g}_t \\
+\mathbf{u}_t &= \max\big(\beta_2\,\mathbf{u}_{t-1},\ |\mathbf{g}_t|\big) \\
+\Delta{\theta}_t &= \alpha\, \frac{\mathbf{v}_t}{\max(\mathbf{u}_t, \varepsilon)}
+\end{aligned}
+$$
+
+where:
+- $t$ is the current step,
+- $\alpha$ is the learning rate (`rate`),
+- $\beta_1$ is the momentum decay (`momentumDecay`),
+- $\beta_2$ is the norm decay (`normDecay`),
+- $\mathbf{g}_t$ is the current gradient and $|\mathbf{g}_t|$ denotes element-wise absolute value,
+- $\varepsilon$ is a small constant for numerical stability (in the implementation, the denominator is clipped from below by `EPSILON`).
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
@@ -12,10 +31,10 @@ A version of the [Adam](adam.md) optimizer that replaces the RMS property with t
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\AdaMax;
+use Rubix\ML\NeuralNet\Optimizers\AdaMax\AdaMax;
 
 $optimizer = new AdaMax(0.0001, 0.1, 0.001);
 ```
 
 ## References
-[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
+[^1]: D. P. Kingma et al. (2014). Adam: A Method for Stochastic Optimization.
@@ -1,8 +1,28 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Cyclical.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Cyclical/Cyclical.php">[source]</a></span>
 
 # Cyclical
 The Cyclical optimizer uses a global learning rate that cycles between the lower and upper bound over a designated period while also decaying the upper bound by a factor at each step. Cyclical learning rates have been shown to help escape bad local minima and saddle points of the gradient.
 
+## Mathematical formulation
+Per step (element-wise), the cyclical learning rate and update are computed as:
+
+$$
+\begin{aligned}
+\text{cycle} &= \left\lfloor 1 + \frac{t}{2\,\text{steps}} \right\rfloor \\
+x &= \left| \frac{t}{\text{steps}} - 2\,\text{cycle} + 1 \right| \\
+\text{scale} &= \text{decay}^{\,t} \\
+\eta_t &= \text{lower} + (\text{upper} - \text{lower})\,\max\bigl(0,1 - x\bigr)\,\text{scale} \\
+\Delta\theta_t &= \eta_t\,g_t
+\end{aligned}
+$$
+
+where:
+- $t$ is the current step counter,
+- $steps$ is the number of steps in every half cycle,
+- $lower$ and $upper$ are the learning rate bounds,
+- $decay$ is the multiplicative decay applied each step,
+- $g_t$ is the current gradient.
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
@@ -13,10 +33,10 @@ The Cyclical optimizer uses a global learning rate that cycles between the lower
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\Cyclical;
+use Rubix\ML\NeuralNet\Optimizers\Cyclical\Cyclical;
 
 $optimizer = new Cyclical(0.001, 0.005, 1000);
 ```
 
 ## References
-[^1]: L. N. Smith. (2017). Cyclical Learning Rates for Training Neural Networks.
+[^1]: L. N. Smith. (2017). Cyclical Learning Rates for Training Neural Networks.
@@ -1,8 +1,33 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Momentum.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/Momentum/Momentum.php">[source]</a></span>
 
 # Momentum
 Momentum accelerates each update step by accumulating velocity from past updates and adding a factor of the previous velocity to the current step. Momentum can help speed up training and escape bad local minima when compared with [Stochastic](stochastic.md) Gradient Descent.
 
+## Mathematical formulation
+Per step (element-wise), Momentum updates the velocity and applies it as the parameter step:
+
+$$
+\begin{aligned}
+\beta &= 1 - \text{decay}, \quad \eta = \text{rate} \\
+\text{Velocity update:}\quad v_t &= \beta\,v_{t-1} + \eta\,g_t \\
+\text{Returned step:}\quad \Delta\theta_t &= v_t
+\end{aligned}
+$$
+
+Nesterov lookahead (when `lookahead = true`) is approximated by applying the velocity update a second time:
+
+$$
+\begin{aligned}
+v_t &\leftarrow \beta\,v_t + \eta\,g_t
+\end{aligned}
+$$
+
+where:
+- $g_t$ is the current gradient,
+- $v_t$ is the velocity (accumulated update),
+- $\beta$ is the momentum coefficient ($1 − decay$),
+- $\eta$ is the learning rate ($rate$).
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
@@ -12,7 +37,7 @@ Momentum accelerates each update step by accumulating velocity from past updates
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\Momentum;
+use Rubix\ML\NeuralNet\Optimizers\Momentum\Momentum;
 
 $optimizer = new Momentum(0.01, 0.1, true);
 ```
 
@@ -1,7 +1,25 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/RMSProp.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/RMSProp/RMSProp.php">[source]</a></span>
 
 # RMS Prop
-An adaptive gradient technique that divides the current gradient over a rolling window of the magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.
+An adaptive gradient technique that divides the current gradient over a rolling window of magnitudes of recent gradients. Unlike [AdaGrad](adagrad.md), RMS Prop does not suffer from an infinitely decaying step size.
+
+## Mathematical formulation
+Per step (element-wise), RMSProp maintains a running average of squared gradients and scales the step by the root-mean-square:
+
+$$
+\begin{aligned}
+\rho &= 1 - \text{decay}, \quad \eta = \text{rate} \\
+\text{Running average:}\quad v_t &= \rho\,v_{t-1} + (1 - \rho)\,g_t^{\,2} \\
+\text{Returned step:}\quad \Delta\theta_t &= \frac{\eta\,g_t}{\max\bigl(\sqrt{v_t},\,\varepsilon\bigr)}
+\end{aligned}
+$$
+
+where:
+- $g_t$ is the current gradient,
+- $v_t$ is the running average of squared gradients,
+- $\rho$ is the averaging coefficient ($1 − decay$),
+- $\eta$ is the learning rate ($rate$),
+- $\varepsilon$ is a small constant to avoid division by zero (implemented by clipping $\sqrt{v_t}$ to $[ε, +∞)$).
 
 ## Parameters
 | # | Name | Default | Type | Description |
@@ -11,10 +29,10 @@ An adaptive gradient technique that divides the current gradient over a rolling
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\RMSProp;
+use Rubix\ML\NeuralNet\Optimizers\RMSProp\RMSProp;
 
 $optimizer = new RMSProp(0.01, 0.1);
 ```
 
 ## References
-[^1]: T. Tieleman et al. (2012). Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude.
+[^1]: T. Tieleman et al. (2012). Lecture 6e rmsprop: Divide the gradient by a running average of its recent magnitude.
@@ -1,8 +1,26 @@
-<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/StepDecay.php">[source]</a></span>
+<span style="float:right;"><a href="https://github.com/RubixML/ML/blob/master/src/NeuralNet/Optimizers/StepDecay/StepDecay.php">[source]</a></span>
 
 # Step Decay
 A learning rate decay optimizer that reduces the global learning rate by a factor whenever it reaches a new *floor*. The number of steps needed to reach a new floor is defined by the *steps* hyper-parameter.
 
+## Mathematical formulation
+Per step (element-wise), the Step Decay learning rate and update are:
+
+$$
+\begin{aligned}
+\text{floor} &= \left\lfloor \frac{t}{k} \right\rfloor \\
+\eta_t &= \frac{\eta_0}{1 + \text{floor}\cdot \lambda} \\
+\Delta\theta_t &= \eta_t\,g_t
+\end{aligned}
+$$
+
+where:
+- $t$ is the current step number,
+- $k$ is the number of steps per floor,
+- $\eta_0$ is the initial learning rate ($rate$),
+- $\lambda$ is the decay factor ($decay$),
+- $g_t$ is the current gradient.
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|
@@ -12,7 +30,7 @@ A learning rate decay optimizer that reduces the global learning rate by a facto
 
 ## Example
 ```php
-use Rubix\ML\NeuralNet\Optimizers\StepDecay;
+use Rubix\ML\NeuralNet\Optimizers\StepDecay\StepDecay;
 
 $optimizer = new StepDecay(0.1, 50, 1e-3);
-```
+```
@@ -3,6 +3,20 @@
 # Stochastic
 A constant learning rate optimizer based on vanilla Stochastic Gradient Descent (SGD).
 
+## Mathematical formulation
+Per step (element-wise), the SGD update scales the gradient by a constant learning rate:
+
+$$
+\begin{aligned}
+\eta &= \text{rate} \\
+\Delta\theta_t &= \eta\,g_t
+\end{aligned}
+$$
+
+where:
+- $g_t$ is the current gradient,
+- $\eta$ is the learning rate ($rate$).
+
 ## Parameters
 | # | Name | Default | Type | Description |
 |---|---|---|---|---|