As I study the basics of Deep Learning from scratch, I wanted to create this repo to annotate in a well-structured fashion the foundational math and keep coming back to refresh this repo as I progress both in content and uderstanding of the topic. Getting the matrix calculus right on paper is extremely useful when coding the math later in from-scratch implementations, especially when building out applied computer vision systems tools.
Since I am primarily working with computer vision, data in the form of images typically enters the system as a 4D tensor:
-
$N$ : Batch Size (number of images processed at once). -
$H \times W$ : Spatial dimensions (Height and Width) of the image. -
$C$ : Channels (e.g., 1 for grayscale, 3 for RGB).
To use a standard fully connected layer, the spatial dimensions must be flattened into a single feature vector of length
The weight matrix maps the flattened input features to a specified number of output neurons (
The bias is a row vector providing a learnable offset for each output neuron:
In software, adding a
To make the bias compatible for matrix addition with the weighted sum (
The complete algebraic representation of a single forward pass in a linear layer is:
-
Weighted Sum (
$XW$ ):$(N \times D) \cdot (D \times M) = (N \times M)$ -
Broadcasted Bias (
$\mathbf{1}_N b$ ):$(N \times 1) \cdot (1 \times M) = (N \times M)$ -
Output (
$Y$ ): Resulting matrix is$(N \times M)$
A two-layer network (often called a Multi-Layer Perceptron with one hidden layer) introduces non-linearity and a second transformation. This allows the model to learn complex, non-linear patterns in data, such as distinguishing anomalies in medical imaging or tracking metrics.
For this architecture, we define two sets of weights and biases:
-
Layer 1 (Hidden Layer):
$W_1 \in \mathbb{R}^{D \times H}$ and$b_1 \in \mathbb{R}^{1 \times H}$ -
Layer 2 (Output Layer):
$W_2 \in \mathbb{R}^{H \times M}$ and$b_2 \in \mathbb{R}^{1 \times M}$ -
Activation Function (
$\sigma$ ): Usually ReLU for the hidden layer.
-
Hidden Layer Transformation (
$Z_1$ ): We calculate the weighted sum of the input and apply the broadcasted bias:$$Z_1 = XW_1 + \mathbf{1}_N b_1$$ -
Activation (
$A_1$ ): We pass the hidden layer's output through a non-linear activation function (like ReLU):$$A_1 = \max(0, Z_1)$$ -
Output Layer Transformation (
$Y$ ): The activated output$A_1$ now serves as the input for the final layer:$$Y = A_1 W_2 + \mathbf{1}_N b_2$$
Keeping track of dimensions is critical when coding these networks from scratch.
| Tensor | Description | Dimension |
|---|---|---|
| Input Batch | ||
| Hidden Weights | ||
| Hidden Bias | ||
| Hidden Activation | ||
| Output Weights | ||
| Output Bias | ||
| Final Output |
To train the network, we must compute the gradients of the loss (
The gradient of the weights is the dot product of the transposed input and the upstream gradient. This maps the error back to the dimensions of the weights:
The gradient of the bias is the sum of the upstream gradients across the batch dimension (
To continue backpropagation to earlier layers, we calculate the gradient with respect to the input by taking the dot product of the upstream gradient and the transposed weight matrix:
Once the gradients are computed, we use them to update the weights. As an economist, I view these optimizers as systems balancing "historical trends" (momentum) with "market volatility" (adaptive learning rates). Let
Momentum accumulates a velocity vector (
RMSProp uses an exponentially decaying average of squared gradients (
Adam is the hybrid workhorse for deep networks, combining the directional velocity of Momentum with the adaptive scaling of RMSProp. It is highly robust and often the default choice for training complex architectures because it handles sparse gradients and noisy data exceptionally well.
To understand Adam, we track two distinct "moments" of the gradient over time step
First, we calculate the moving averages of the gradient and its square.
-
The First Moment (
$m_t$ ): This acts like Momentum. It estimates the mean (first moment) of the gradients, tracking the general "trend" or direction. -
The Second Moment (
$v_t$ ): This acts like RMSProp. It estimates the uncentered variance (second moment) of the gradients, tracking the "volatility" or scale of the updates.
Because the moving averages
To fix this, Adam applies a Bias Correction based on the current iteration/time step
Mathematical Intuition: Notice that in the first few iterations,
Finally, we use the bias-corrected moments to update the parameters. We step in the direction of the trend (
When implementing Adam from scratch, the industry-standard default values established in the original Kingma & Ba paper are:
-
$\alpha = 0.001$ (Learning rate) -
$\beta_1 = 0.9$ (Decay rate for the first moment) -
$\beta_2 = 0.999$ (Decay rate for the second moment) -
$\epsilon = 10^{-8}$ (Numerical stability constant)
When building out custom architectures for my applied data science and computer vision projects, understanding the entire lifecycle of a neural network is non-negotiable. The process is a continuous loop of three phases: the Forward Pass, Backpropagation, and the Parameter Update.
Below is the complete algebraic breakdown for a standard Two-Layer Neural Network.
The forward pass is where the model makes its prediction. We push the input matrix
-
First Linear Transformation:
$$Z_1 = X W_1 + \mathbf{1}_N b_1$$ -
Non-linear Activation (ReLU):
$$A_1 = \max(0, Z_1)$$ -
Second Linear Transformation (Scores):
$$Z_2 = A_1 W_2 + \mathbf{1}_N b_2$$ -
The Loss Function (
$L$ ): We evaluate how far off our scores$Z_2$ are from the ground truth$Y_{true}$ . The loss collapses our high-dimensional predictions into a single, measurable scalar value:$$L = \text{Loss}(Z_2, Y_{true})$$
To improve the model, I need to know how every single weight in
In vector calculus, if we have a function that maps an
For example, the local derivative of our hidden layer
When writing this in code, constructing a full Jacobian matrix for millions of parameters would crash the system's memory. Instead, backpropagation relies on the Vector-Jacobian Product. We take the "upstream gradient" (how much the loss cares about the output of a layer) and multiply it by the local gradient (the Jacobian of that specific layer) to get the "downstream gradient" to pass backward.
Using the chain rule, we compute the gradients starting from the end of the network and moving backward.
1. Gradient at the Output:
We start with the derivative of the loss with respect to our raw scores.
2. Gradients of Layer 2 Parameters:
We use
3. Passing the Gradient Backward (Input to Layer 2):
To keep moving backward, we need the gradient with respect to the activation
4. The ReLU Local Jacobian:
The derivative of the ReLU function
5. Gradients of Layer 1 Parameters:
Finally, we compute the gradients for our first layer's weights and biases.