Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
*.pkl filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pdf filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
8 changes: 7 additions & 1 deletion data/datasets.yml
Original file line number Diff line number Diff line change
Expand Up @@ -151,4 +151,10 @@ units:
title: "Incremental capstone 9"
datasets:
- name: "Banking customer data"
file: "Churn_Modeling.csv"
file: "Churn_Modeling.csv"

- number: "INC10"
title: "Incremental capstone 10"
datasets:
- name: "Face mask detection dataset"
file: "Face_mask_detection.zip"
3 changes: 3 additions & 0 deletions data/unit4/Face_mask_detection.zip
Git LFS file not shown
20 changes: 10 additions & 10 deletions site/resource_pages/optimizer_summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,25 +20,25 @@ This document summarizes the four optimizers compared in the Lesson 30 demo.
## Optimizers overview

### 1. SGD (Stochastic Gradient Descent)
Vanilla gradient descent that updates parameters based on the gradient of the loss function. When `batch_size=1`, it's true stochastic gradient descent; with larger batches, it becomes mini-batch gradient descent. Simple but can be slow to converge and sensitive to learning rate choice.
Vanilla gradient descent that updates parameters based on the gradient of the loss function. When `batch_size=1`, it's true stochastic gradient descent; with larger batches, it becomes mini-batch gradient descent. Simple but can be slow to converge and sensitive to learning rate choice. Citation: [Robbins and Monro, 1951](https://projecteuclid.org/euclid.aoms/1177729586).

### 2. SGD + Momentum
Extends vanilla SGD by accumulating a velocity vector in directions of persistent gradient descent. This helps accelerate convergence in relevant directions and dampens oscillations. A momentum value of 0.9 is commonly used.
Extends vanilla SGD by accumulating a velocity vector in directions of persistent gradient descent. This helps accelerate convergence in relevant directions and dampens oscillations. A momentum value of 0.9 is commonly used. Citation: [Polyak, 1964](https://doi.org/10.1016/0041-5553(64)90137-5).

### 3. RMSprop (Root Mean Square Propagation)
An adaptive learning rate optimizer that divides the learning rate by an exponentially decaying average of squared gradients. This allows the optimizer to use larger steps for infrequent features and smaller steps for frequent ones, making it well-suited for non-stationary objectives.
An adaptive learning rate optimizer that divides the learning rate by an exponentially decaying average of squared gradients. This allows the optimizer to use larger steps for infrequent features and smaller steps for frequent ones, making it well-suited for non-stationary objectives. Citation: [Hinton, 2012](https://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf).

### 4. Adam (Adaptive Moment Estimation)
Combines the best of momentum and RMSprop. It computes adaptive learning rates for each parameter using estimates of both first-order moments (mean) and second-order moments (variance) of the gradients. Adam is often the default choice due to its robustness across different problems.
Combines the best of momentum and RMSprop. It computes adaptive learning rates for each parameter using estimates of both first-order moments (mean) and second-order moments (variance) of the gradients. Adam is often the default choice due to its robustness across different problems. Citation: [Kingma and Ba, 2014](https://arxiv.org/abs/1412.6980).

## Optimization techniques comparison

| Optimizer | Momentum | Adaptive learning rate | Notes |
|---------------|:--------:|:----------------------:|--------------------------------------------|
| SGD | ❌ | ❌ | Vanilla gradient descent |
| SGD + Momentum| ✅ | ❌ | Uses velocity accumulation (e.g., 0.9) |
| RMSprop | ❌ | ✅ | Per-parameter learning rate scaling |
| Adam | ✅ | ✅ | Combines momentum + adaptive rates |
| Optimizer | Year introduced | Momentum | Adaptive learning rate | Notes |
|----------------|:---------------:|:--------:|:----------------------:|--------------------------------------------|
| SGD | 1958 | ❌ | ❌ | Vanilla gradient descent |
| SGD + Momentum | 1964 | ✅ | ❌ | Uses velocity accumulation (e.g., 0.9) |
| RMSprop | 2012 | ❌ | ✅ | Per-parameter learning rate scaling |
| Adam | 2014 | ✅ | ✅ | Combines momentum + adaptive rates |

## Key takeaways from the demo

Expand Down