ContextLab
diff --git a/‎slides/README.md‎
Lines changed: 14 additions & 4 deletions b/‎slides/README.md‎
Lines changed: 14 additions & 4 deletions
diff --git a/‎slides/week6/lecture19.html‎
Lines changed: 146 additions & 61 deletions b/‎slides/week6/lecture19.html‎
Lines changed: 146 additions & 61 deletions
diff --git a/‎slides/week6/lecture19.md‎
Lines changed: 81 additions & 43 deletions b/‎slides/week6/lecture19.md‎
Lines changed: 81 additions & 43 deletions
diff --git a/‎slides/week6/lecture19.pdf‎
115 KB b/‎slides/week6/lecture19.pdf‎
115 KB
@@ -203,23 +203,28 @@ Explore concepts hands-on with our interactive web demos! Each demo runs directl
 - 📊 [Slides PDF](https://contextlab.github.io/llm-course/slides/week6/lecture18.pdf) | 🌐 [Slides HTML](https://contextlab.github.io/llm-course/slides/week6/lecture18.html)
 
 **Wednesday (Lecture 19):** BERT Variants
-- RoBERTa, ALBERT, DistilBERT, ELECTRA, DeBERTa, ModernBERT
+- RoBERTa, ALBERT, DistilBERT, ELECTRA, DeBERTa, ModernBERT, Gemma Encoder
 - Training recipe improvements, parameter sharing, knowledge distillation, replaced token detection
+- ModernBERT deep dive: unpadding, RoPE, Flash Attention, 8192 context
+- Production deployment patterns: ONNX, TensorRT, quantization
 - Are encoder models still relevant in the era of GPT-4?
 - Reading: [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) - RoBERTa
 - Reading: [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) - DistilBERT
 - Reading: [Clark et al. (2020)](https://arxiv.org/abs/2003.10555) - ELECTRA
 - Reading: [Warner et al. (2024)](https://arxiv.org/abs/2412.13663) - ModernBERT
+- Reading: [Google (2025)](https://arxiv.org/abs/2503.02656) - Gemma Encoder
 - 📓 [Companion Notebook](https://colab.research.google.com/github/ContextLab/llm-course/blob/main/slides/week6/bert_variants_demo.ipynb)
 - 📊 [Slides PDF](https://contextlab.github.io/llm-course/slides/week6/lecture19.pdf) | 🌐 [Slides HTML](https://contextlab.github.io/llm-course/slides/week6/lecture19.html)
 
 **Friday (Lecture 20):** Applications of Encoder Models
-- Classification, NER, question answering, semantic similarity
+- Novel applications: clinical NLP, legal tech, financial NER, scientific literature
+- Sentence-BERT and modern retrieval: from SBERT to E5 to NV-Embed (MTEB)
 - BERT in Google Search, industry adoption ($7.73B NLP market)
 - Brain-LLM alignment: neural encoding with language model representations
 - Systematic bias measurement (SAGED pipeline)
 - Reading: [Caucheteux & King (2022)](https://doi.org/10.1038/s42003-022-03036-1) - Brain-LLM alignment
 - Reading: [Aw et al. (2026)](https://openreview.net/forum?id=PgIlCCNxdB) - The Mind's Transformer (ICLR 2026)
+- Reading: [Reimers & Gurevych (2019)](https://arxiv.org/abs/1908.10084) - Sentence-BERT
 - Reading: [Jiang et al. (2025)](https://aclanthology.org/2025.coling-main.202.pdf) - SAGED bias evaluation
 - 📓 [Companion Notebook](https://colab.research.google.com/github/ContextLab/llm-course/blob/main/slides/week6/encoder_applications_demo.ipynb)
 - 📊 [Slides PDF](https://contextlab.github.io/llm-course/slides/week6/lecture20.pdf) | 🌐 [Slides HTML](https://contextlab.github.io/llm-course/slides/week6/lecture20.html)
@@ -228,11 +233,15 @@ Explore concepts hands-on with our interactive web demos! Each demo runs directl
 ## Week 7: Decoder Models & GPT
 
 **Monday (Lecture 21):** GPT Architecture
-- Autoregressive generation, causal masking, the decoder stack
+- Generative pre-training paradigm, fine-tuning, weight tying
+- The BooksCorpus controversy: training data ethics
+- Open-weight decoders: the LLaMA revolution
+- Test-time compute and inference scaling
 - Modern decoder innovations: RMSNorm, SwiGLU, RoPE, GQA
 - Multi-token prediction and hybrid architectures (Jamba)
 - Reading: [Radford et al. (2018)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) - GPT-1
 - Reading: [Radford et al. (2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) - GPT-2
+- Reading: [Touvron et al. (2023)](https://arxiv.org/abs/2302.13971) - LLaMA
 - Reading: [Gloeckle et al. (2024)](https://arxiv.org/abs/2404.19737) - Multi-token prediction
 - **📝 Assignment 4 Due (Feb 16, 11:59 PM EST)**
 - **Final Project Released:** [Final Project](https://contextlab.github.io/llm-course/assignments/final-project/) (Due: Mar 9, 11:59 PM EST)
@@ -252,8 +261,9 @@ Explore concepts hands-on with our interactive web demos! Each demo runs directl
 **Friday (Lecture 23):** Implementing GPT from Scratch
 - Build a complete mini-GPT (~30M params) in PyTorch
 - Tokenization, embeddings, masked attention, transformer blocks, training loop
+- Weight tying, gradient accumulation, LR scheduling, mixed precision training
 - Text generation: greedy, temperature, top-k, nucleus sampling
-- KV caching and FlashAttention for efficient inference
+- KV caching, FlashAttention, and nanoGPT comparison
 - Tutorial: [Let's build GPT (Karpathy)](https://www.youtube.com/watch?v=kCc8FmEb1nY)
 - Reading: [Dao et al. (2022)](https://arxiv.org/abs/2205.14135) - FlashAttention
 - 📓 [Companion Notebook](https://colab.research.google.com/github/ContextLab/llm-course/blob/main/slides/week7/gpt_from_scratch_demo.ipynb)
 
@@ -29,17 +29,15 @@ Winter 2026
 
 ---
 
-# Recall: what BERT does
-
-<div class="note-box" data-title="From lectures 12 and 18">
+# What could be improved?
 
-We covered BERT's architecture (MLM + NSP, bidirectional attention) in Lecture 12, and explored what BERT actually *learns* (attention patterns, layer probing, neuroscience connections) in Lecture 18.
+<div class="warning-box" data-title="BERT's key limitations (from Lecture 18)">
 
-</div>
+**Training procedure:** NSP may hurt performance; static masking reuses the same masks every epoch; only 15% of tokens provide training signal.
 
-<div class="tip-box" data-title="Today's focus">
+**Scale:** Trained on only 3.3B words with 100K steps — modern datasets are 100× larger.
 
-Today: how researchers improved on BERT's original recipe, and whether encoder models still matter in 2026.
+**Efficiency:** 110M parameters are all active for every input. Large memory footprint for deployment.
 
 </div>
 
@@ -51,20 +49,6 @@ Today: how researchers improved on BERT's original recipe, and whether encoder m
 
 ---
 
-# What could be improved?
-
-<div class="warning-box" data-title="BERT's key limitations">
-
-**Training procedure:** NSP may hurt performance; static masking reuses the same masks every epoch; only 15% of tokens provide training signal.
-
-**Scale:** Trained on only 3.3B words with 100K steps — modern datasets are 100x larger.
-
-**Efficiency:** 110M parameters are all active for every input. Large memory footprint for deployment.
-
-</div>
-
----
-
 # RoBERTa: robustly optimized BERT
 
 <div class="definition-box" data-title="Key idea: better training = better performance">
@@ -86,23 +70,17 @@ Training procedure matters as much as architecture. RoBERTa shows that BERT was
 
 ---
 
-# Dynamic vs static masking
+# Dynamic masking
 
-<div class="note-box" data-title="How masking patterns are generated">
+<div class="note-box" data-title="RoBERTa's key innovation">
 
-**Static masking (BERT):**
-- Mask tokens once during preprocessing
-- Same masks reused every epoch
-- Example: "My [MASK] is cute" → same every time
-- Risk: model memorizes mask positions
+BERT used **static masking** — the same mask positions every epoch, risking memorization. RoBERTa generates **new masks on-the-fly** during training:
 
-**Dynamic masking (RoBERTa):**
-- Generate new masks on-the-fly during training
-- Different masks each time the same sequence is seen
 - Epoch 1: "My [MASK] is cute"
 - Epoch 2: "My dog [MASK] cute"
 - Epoch 3: "My dog is [MASK]"
-- Result: more diverse training signal, better generalization
+
+More diverse training signal → better generalization. Combined with removing NSP and training longer on more data, this alone accounts for most of RoBERTa's gains.
 
 </div>
 
@@ -322,11 +300,7 @@ discriminator(corrupted)
 
 <div class="note-box" data-title="Learning from every token">
 
-**BERT:** Learns from 15% of tokens (masked ones only) — 85% of compute generates no training signal.
-
-**ELECTRA:** Learns from 100% of tokens (all get a real/replaced label) — every position contributes to learning.
-
-**Result:** ELECTRA reaches BERT-level performance with **4× less compute**.
+ELECTRA learns from **100% of tokens** (every position gets a real/replaced label), compared to BERT's 15%. This 6.7× increase in training signal means ELECTRA reaches BERT-level performance with **4× less compute**.
 
 </div>
 
@@ -482,12 +456,74 @@ for name, model_id in models.items():
 | Model | Year | Key innovation | Quality vs BERT |
 |-------|------|---------------|----------------|
 | BERT | 2018 | MLM + NSP | Baseline |
-| RoBERTa | 2019 | Better training recipe | +3-11 pts |
+| RoBERTa | 2019 | Better training recipe | +3–11 pts |
 | ALBERT | 2019 | Parameter sharing | 89% fewer params |
 | DistilBERT | 2019 | Knowledge distillation | 97% quality, 60% faster |
-| ELECTRA | 2020 | Learn from all tokens | 4x less compute |
+| ELECTRA | 2020 | Learn from all tokens | 4× less compute |
 | DeBERTa | 2020 | Disentangled attention | SOTA on SuperGLUE |
-| [ModernBERT](https://arxiv.org/abs/2412.13663) | 2024 | Modern training + RoPE + Flash Attention | SOTA on GLUE, retrieval |
+| ModernBERT | 2024 | Modern training + RoPE + Flash Attention | SOTA on GLUE, retrieval |
+
+</div>
+
+---
+
+# ModernBERT deep dive
+
+<div class="definition-box" data-title="Warner et al. (Dec 2024): bringing modern LLM techniques to encoders">
+
+[ModernBERT](https://arxiv.org/abs/2412.13663) applies 6 years of decoder innovations to the encoder architecture:
+
+| Innovation | BERT (2018) | ModernBERT (2024) |
+|-----------|-------------|-------------------|
+| Position encoding | Learned absolute (512 max) | **RoPE** (8,192 tokens) |
+| Attention | Full quadratic | **Flash Attention** + alternating global/local |
+| Padding | Processes pad tokens | **Unpadding** (only real tokens) |
+| Training data | 3.3B words | **2 trillion tokens** (600× more) |
+| Code understanding | None | Trained on code corpora |
+
+</div>
+
+<div class="important-box" data-title="Results">
+
+SOTA on GLUE, retrieval (MTEB), and code understanding benchmarks. Available as `answerdotai/ModernBERT-base` and `answerdotai/ModernBERT-large` on HuggingFace. Proves that encoder architecture still has room to grow when given modern training techniques.
+
+</div>
+
+---
+
+# Gemma Encoder and the encoder renaissance
+
+<div class="note-box" data-title="Google bets on encoders again (2025)">
+
+[Gemma Encoder](https://arxiv.org/abs/2503.02656) (2025): Google's first encoder-only model since BERT, built by repurposing Gemma decoder weights for bidirectional encoding. Competitive with ModernBERT on sentence embedding tasks.
+
+</div>
+
+<div class="tip-box" data-title="Why this matters">
+
+If Google — a company betting heavily on decoder-only models (Gemini) — still releases an encoder model, it signals that encoders serve a purpose decoders can't efficiently fill. The encoder isn't dead; it's being **modernized**.
+
+</div>
+
+---
+
+# Production deployment patterns
+
+<div class="note-box" data-title="Getting encoders from prototype to production">
+
+| Optimization | Speedup | Memory savings | Quality loss |
+|-------------|---------|----------------|-------------|
+| **ONNX Runtime** | 2–5× | Moderate | None |
+| **TensorRT** (NVIDIA) | 5–10× | Moderate | None |
+| **INT8 quantization** | 2–4× | **4× smaller** | <1% |
+| **Distillation** (→ DistilBERT) | 1.6× | 40% smaller | ~3% |
+| **Full pipeline** (distill → quantize) | 10–20× | **8× smaller** | ~4% |
+
+</div>
+
+<div class="important-box" data-title="The cost argument">
+
+DistilBERT + INT8 quantization handles classification at ~$0.001/M tokens and ~2ms per request. GPT-4 costs ~$30/M tokens at ~500ms per request. For high-volume single-task workloads, optimized encoders are **30,000× cheaper** and **250× faster**.
 
 </div>
 
@@ -503,9 +539,9 @@ for name, model_id in models.items():
 - In-context learning eliminates the need for task-specific architectures
 
 **The case for "no, encoders still matter":**
-- [ModernBERT](https://arxiv.org/abs/2412.13663) (Dec 2024): encoder with modern techniques (RoPE, Flash Attention, 8192 context) achieves SOTA on retrieval and classification — faster and cheaper than any decoder
-- [Gemma Encoder](https://arxiv.org/abs/2503.02656) (2025): Google releases encoder-only Gemma, proving the architecture still has legs
-- Encoders are 10-100x cheaper to run than decoder models for classification tasks
+- ModernBERT (2024) achieves SOTA on retrieval and classification — faster and cheaper than any decoder
+- Gemma Encoder (2025) proves Google still sees value in the architecture
+- Encoders are 10–100× cheaper to run than decoder models for classification tasks
 - Most production search and retrieval systems still use encoders (Sentence-BERT, E5, NV-Embed)
 
 **The real answer:** It depends on your constraints. Encoders win on cost and latency. Decoders win on flexibility.
@@ -547,6 +583,8 @@ for name, model_id in models.items():
 
 [**Warner et al. (2024, *arXiv*)**](https://arxiv.org/abs/2412.13663) "ModernBERT" — Modern encoder with RoPE + Flash Attention.
 
+[**Google (2025, *arXiv*)**](https://arxiv.org/abs/2503.02656) "Gemma Encoder" — Encoder-only Gemma for sentence embeddings.
+
 </div>
 
 ---