Skip to content

Commit 51b6d73

Browse files
committed
Remove redundancies and add new content in lectures 19-26
Lectures 19-26 rewritten to eliminate content that repeated earlier lectures (15-18) and replace with new, substantive material: - L19: Removed BERT recall slide, added ModernBERT deep dive, Gemma Encoder, production deployment patterns (ONNX/TensorRT/quantization) - L20: Removed recall slide, added novel domain applications (clinical, legal, financial), Sentence-BERT and modern retrieval (MTEB) - L21: Removed 4 redundant slides (transformer decoder, causal attention, BERT vs GPT), added BooksCorpus ethics, open-weight decoders (LLaMA revolution), test-time compute, weight tying - L22: Condensed scaling laws and RLHF recaps, focused on overtraining era and why ChatGPT mattered - L23: Stripped theory recaps from code slides, added weight tying, gradient accumulation, LR scheduling, mixed precision, nanoGPT comparison - L24: Condensed CoT and RAG recaps, added OpenAI deep research and Claude Code mentions, focused agent table on agent paradigms - L25: Removed repeated GPT parameter table, condensed Chinchilla recap - L26: Replaced RLHF re-explanation with representation engineering and weak-to-strong generalization, added sycophancy bias All 8 lectures recompiled to HTML and PDF. README updated with new topics.
1 parent 7a57e25 commit 51b6d73

25 files changed

+1006
-653
lines changed

slides/README.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -203,23 +203,28 @@ Explore concepts hands-on with our interactive web demos! Each demo runs directl
203203
- 📊 [Slides PDF](https://contextlab.github.io/llm-course/slides/week6/lecture18.pdf) | 🌐 [Slides HTML](https://contextlab.github.io/llm-course/slides/week6/lecture18.html)
204204

205205
**Wednesday (Lecture 19):** BERT Variants
206-
- RoBERTa, ALBERT, DistilBERT, ELECTRA, DeBERTa, ModernBERT
206+
- RoBERTa, ALBERT, DistilBERT, ELECTRA, DeBERTa, ModernBERT, Gemma Encoder
207207
- Training recipe improvements, parameter sharing, knowledge distillation, replaced token detection
208+
- ModernBERT deep dive: unpadding, RoPE, Flash Attention, 8192 context
209+
- Production deployment patterns: ONNX, TensorRT, quantization
208210
- Are encoder models still relevant in the era of GPT-4?
209211
- Reading: [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) - RoBERTa
210212
- Reading: [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) - DistilBERT
211213
- Reading: [Clark et al. (2020)](https://arxiv.org/abs/2003.10555) - ELECTRA
212214
- Reading: [Warner et al. (2024)](https://arxiv.org/abs/2412.13663) - ModernBERT
215+
- Reading: [Google (2025)](https://arxiv.org/abs/2503.02656) - Gemma Encoder
213216
- 📓 [Companion Notebook](https://colab.research.google.com/github/ContextLab/llm-course/blob/main/slides/week6/bert_variants_demo.ipynb)
214217
- 📊 [Slides PDF](https://contextlab.github.io/llm-course/slides/week6/lecture19.pdf) | 🌐 [Slides HTML](https://contextlab.github.io/llm-course/slides/week6/lecture19.html)
215218

216219
**Friday (Lecture 20):** Applications of Encoder Models
217-
- Classification, NER, question answering, semantic similarity
220+
- Novel applications: clinical NLP, legal tech, financial NER, scientific literature
221+
- Sentence-BERT and modern retrieval: from SBERT to E5 to NV-Embed (MTEB)
218222
- BERT in Google Search, industry adoption ($7.73B NLP market)
219223
- Brain-LLM alignment: neural encoding with language model representations
220224
- Systematic bias measurement (SAGED pipeline)
221225
- Reading: [Caucheteux & King (2022)](https://doi.org/10.1038/s42003-022-03036-1) - Brain-LLM alignment
222226
- Reading: [Aw et al. (2026)](https://openreview.net/forum?id=PgIlCCNxdB) - The Mind's Transformer (ICLR 2026)
227+
- Reading: [Reimers & Gurevych (2019)](https://arxiv.org/abs/1908.10084) - Sentence-BERT
223228
- Reading: [Jiang et al. (2025)](https://aclanthology.org/2025.coling-main.202.pdf) - SAGED bias evaluation
224229
- 📓 [Companion Notebook](https://colab.research.google.com/github/ContextLab/llm-course/blob/main/slides/week6/encoder_applications_demo.ipynb)
225230
- 📊 [Slides PDF](https://contextlab.github.io/llm-course/slides/week6/lecture20.pdf) | 🌐 [Slides HTML](https://contextlab.github.io/llm-course/slides/week6/lecture20.html)
@@ -228,11 +233,15 @@ Explore concepts hands-on with our interactive web demos! Each demo runs directl
228233
## Week 7: Decoder Models & GPT
229234

230235
**Monday (Lecture 21):** GPT Architecture
231-
- Autoregressive generation, causal masking, the decoder stack
236+
- Generative pre-training paradigm, fine-tuning, weight tying
237+
- The BooksCorpus controversy: training data ethics
238+
- Open-weight decoders: the LLaMA revolution
239+
- Test-time compute and inference scaling
232240
- Modern decoder innovations: RMSNorm, SwiGLU, RoPE, GQA
233241
- Multi-token prediction and hybrid architectures (Jamba)
234242
- Reading: [Radford et al. (2018)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) - GPT-1
235243
- Reading: [Radford et al. (2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) - GPT-2
244+
- Reading: [Touvron et al. (2023)](https://arxiv.org/abs/2302.13971) - LLaMA
236245
- Reading: [Gloeckle et al. (2024)](https://arxiv.org/abs/2404.19737) - Multi-token prediction
237246
- **📝 Assignment 4 Due (Feb 16, 11:59 PM EST)**
238247
- **Final Project Released:** [Final Project](https://contextlab.github.io/llm-course/assignments/final-project/) (Due: Mar 9, 11:59 PM EST)
@@ -252,8 +261,9 @@ Explore concepts hands-on with our interactive web demos! Each demo runs directl
252261
**Friday (Lecture 23):** Implementing GPT from Scratch
253262
- Build a complete mini-GPT (~30M params) in PyTorch
254263
- Tokenization, embeddings, masked attention, transformer blocks, training loop
264+
- Weight tying, gradient accumulation, LR scheduling, mixed precision training
255265
- Text generation: greedy, temperature, top-k, nucleus sampling
256-
- KV caching and FlashAttention for efficient inference
266+
- KV caching, FlashAttention, and nanoGPT comparison
257267
- Tutorial: [Let's build GPT (Karpathy)](https://www.youtube.com/watch?v=kCc8FmEb1nY)
258268
- Reading: [Dao et al. (2022)](https://arxiv.org/abs/2205.14135) - FlashAttention
259269
- 📓 [Companion Notebook](https://colab.research.google.com/github/ContextLab/llm-course/blob/main/slides/week7/gpt_from_scratch_demo.ipynb)

slides/week6/lecture19.html

Lines changed: 146 additions & 61 deletions
Large diffs are not rendered by default.

slides/week6/lecture19.md

Lines changed: 81 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -29,17 +29,15 @@ Winter 2026
2929

3030
---
3131

32-
# Recall: what BERT does
33-
34-
<div class="note-box" data-title="From lectures 12 and 18">
32+
# What could be improved?
3533

36-
We covered BERT's architecture (MLM + NSP, bidirectional attention) in Lecture 12, and explored what BERT actually *learns* (attention patterns, layer probing, neuroscience connections) in Lecture 18.
34+
<div class="warning-box" data-title="BERT's key limitations (from Lecture 18)">
3735

38-
</div>
36+
**Training procedure:** NSP may hurt performance; static masking reuses the same masks every epoch; only 15% of tokens provide training signal.
3937

40-
<div class="tip-box" data-title="Today's focus">
38+
**Scale:** Trained on only 3.3B words with 100K steps — modern datasets are 100× larger.
4139

42-
Today: how researchers improved on BERT's original recipe, and whether encoder models still matter in 2026.
40+
**Efficiency:** 110M parameters are all active for every input. Large memory footprint for deployment.
4341

4442
</div>
4543

@@ -51,20 +49,6 @@ Today: how researchers improved on BERT's original recipe, and whether encoder m
5149

5250
---
5351

54-
# What could be improved?
55-
56-
<div class="warning-box" data-title="BERT's key limitations">
57-
58-
**Training procedure:** NSP may hurt performance; static masking reuses the same masks every epoch; only 15% of tokens provide training signal.
59-
60-
**Scale:** Trained on only 3.3B words with 100K steps — modern datasets are 100x larger.
61-
62-
**Efficiency:** 110M parameters are all active for every input. Large memory footprint for deployment.
63-
64-
</div>
65-
66-
---
67-
6852
# RoBERTa: robustly optimized BERT
6953

7054
<div class="definition-box" data-title="Key idea: better training = better performance">
@@ -86,23 +70,17 @@ Training procedure matters as much as architecture. RoBERTa shows that BERT was
8670

8771
---
8872

89-
# Dynamic vs static masking
73+
# Dynamic masking
9074

91-
<div class="note-box" data-title="How masking patterns are generated">
75+
<div class="note-box" data-title="RoBERTa's key innovation">
9276

93-
**Static masking (BERT):**
94-
- Mask tokens once during preprocessing
95-
- Same masks reused every epoch
96-
- Example: "My [MASK] is cute" → same every time
97-
- Risk: model memorizes mask positions
77+
BERT used **static masking** — the same mask positions every epoch, risking memorization. RoBERTa generates **new masks on-the-fly** during training:
9878

99-
**Dynamic masking (RoBERTa):**
100-
- Generate new masks on-the-fly during training
101-
- Different masks each time the same sequence is seen
10279
- Epoch 1: "My [MASK] is cute"
10380
- Epoch 2: "My dog [MASK] cute"
10481
- Epoch 3: "My dog is [MASK]"
105-
- Result: more diverse training signal, better generalization
82+
83+
More diverse training signal → better generalization. Combined with removing NSP and training longer on more data, this alone accounts for most of RoBERTa's gains.
10684

10785
</div>
10886

@@ -322,11 +300,7 @@ discriminator(corrupted)
322300

323301
<div class="note-box" data-title="Learning from every token">
324302

325-
**BERT:** Learns from 15% of tokens (masked ones only) — 85% of compute generates no training signal.
326-
327-
**ELECTRA:** Learns from 100% of tokens (all get a real/replaced label) — every position contributes to learning.
328-
329-
**Result:** ELECTRA reaches BERT-level performance with **4× less compute**.
303+
ELECTRA learns from **100% of tokens** (every position gets a real/replaced label), compared to BERT's 15%. This 6.7× increase in training signal means ELECTRA reaches BERT-level performance with **4× less compute**.
330304

331305
</div>
332306

@@ -482,12 +456,74 @@ for name, model_id in models.items():
482456
| Model | Year | Key innovation | Quality vs BERT |
483457
|-------|------|---------------|----------------|
484458
| BERT | 2018 | MLM + NSP | Baseline |
485-
| RoBERTa | 2019 | Better training recipe | +3-11 pts |
459+
| RoBERTa | 2019 | Better training recipe | +311 pts |
486460
| ALBERT | 2019 | Parameter sharing | 89% fewer params |
487461
| DistilBERT | 2019 | Knowledge distillation | 97% quality, 60% faster |
488-
| ELECTRA | 2020 | Learn from all tokens | 4x less compute |
462+
| ELECTRA | 2020 | Learn from all tokens | less compute |
489463
| DeBERTa | 2020 | Disentangled attention | SOTA on SuperGLUE |
490-
| [ModernBERT](https://arxiv.org/abs/2412.13663) | 2024 | Modern training + RoPE + Flash Attention | SOTA on GLUE, retrieval |
464+
| ModernBERT | 2024 | Modern training + RoPE + Flash Attention | SOTA on GLUE, retrieval |
465+
466+
</div>
467+
468+
---
469+
470+
# ModernBERT deep dive
471+
472+
<div class="definition-box" data-title="Warner et al. (Dec 2024): bringing modern LLM techniques to encoders">
473+
474+
[ModernBERT](https://arxiv.org/abs/2412.13663) applies 6 years of decoder innovations to the encoder architecture:
475+
476+
| Innovation | BERT (2018) | ModernBERT (2024) |
477+
|-----------|-------------|-------------------|
478+
| Position encoding | Learned absolute (512 max) | **RoPE** (8,192 tokens) |
479+
| Attention | Full quadratic | **Flash Attention** + alternating global/local |
480+
| Padding | Processes pad tokens | **Unpadding** (only real tokens) |
481+
| Training data | 3.3B words | **2 trillion tokens** (600× more) |
482+
| Code understanding | None | Trained on code corpora |
483+
484+
</div>
485+
486+
<div class="important-box" data-title="Results">
487+
488+
SOTA on GLUE, retrieval (MTEB), and code understanding benchmarks. Available as `answerdotai/ModernBERT-base` and `answerdotai/ModernBERT-large` on HuggingFace. Proves that encoder architecture still has room to grow when given modern training techniques.
489+
490+
</div>
491+
492+
---
493+
494+
# Gemma Encoder and the encoder renaissance
495+
496+
<div class="note-box" data-title="Google bets on encoders again (2025)">
497+
498+
[Gemma Encoder](https://arxiv.org/abs/2503.02656) (2025): Google's first encoder-only model since BERT, built by repurposing Gemma decoder weights for bidirectional encoding. Competitive with ModernBERT on sentence embedding tasks.
499+
500+
</div>
501+
502+
<div class="tip-box" data-title="Why this matters">
503+
504+
If Google — a company betting heavily on decoder-only models (Gemini) — still releases an encoder model, it signals that encoders serve a purpose decoders can't efficiently fill. The encoder isn't dead; it's being **modernized**.
505+
506+
</div>
507+
508+
---
509+
510+
# Production deployment patterns
511+
512+
<div class="note-box" data-title="Getting encoders from prototype to production">
513+
514+
| Optimization | Speedup | Memory savings | Quality loss |
515+
|-------------|---------|----------------|-------------|
516+
| **ONNX Runtime** | 2–5× | Moderate | None |
517+
| **TensorRT** (NVIDIA) | 5–10× | Moderate | None |
518+
| **INT8 quantization** | 2–4× | **4× smaller** | <1% |
519+
| **Distillation** (→ DistilBERT) | 1.6× | 40% smaller | ~3% |
520+
| **Full pipeline** (distill → quantize) | 10–20× | **8× smaller** | ~4% |
521+
522+
</div>
523+
524+
<div class="important-box" data-title="The cost argument">
525+
526+
DistilBERT + INT8 quantization handles classification at ~$0.001/M tokens and ~2ms per request. GPT-4 costs ~$30/M tokens at ~500ms per request. For high-volume single-task workloads, optimized encoders are **30,000× cheaper** and **250× faster**.
491527

492528
</div>
493529

@@ -503,9 +539,9 @@ for name, model_id in models.items():
503539
- In-context learning eliminates the need for task-specific architectures
504540

505541
**The case for "no, encoders still matter":**
506-
- [ModernBERT](https://arxiv.org/abs/2412.13663) (Dec 2024): encoder with modern techniques (RoPE, Flash Attention, 8192 context) achieves SOTA on retrieval and classification — faster and cheaper than any decoder
507-
- [Gemma Encoder](https://arxiv.org/abs/2503.02656) (2025): Google releases encoder-only Gemma, proving the architecture still has legs
508-
- Encoders are 10-100x cheaper to run than decoder models for classification tasks
542+
- ModernBERT (2024) achieves SOTA on retrieval and classification — faster and cheaper than any decoder
543+
- Gemma Encoder (2025) proves Google still sees value in the architecture
544+
- Encoders are 10–100× cheaper to run than decoder models for classification tasks
509545
- Most production search and retrieval systems still use encoders (Sentence-BERT, E5, NV-Embed)
510546

511547
**The real answer:** It depends on your constraints. Encoders win on cost and latency. Decoders win on flexibility.
@@ -547,6 +583,8 @@ for name, model_id in models.items():
547583

548584
[**Warner et al. (2024, *arXiv*)**](https://arxiv.org/abs/2412.13663) "ModernBERT" — Modern encoder with RoPE + Flash Attention.
549585

586+
[**Google (2025, *arXiv*)**](https://arxiv.org/abs/2503.02656) "Gemma Encoder" — Encoder-only Gemma for sentence embeddings.
587+
550588
</div>
551589

552590
---

slides/week6/lecture19.pdf

115 KB
Binary file not shown.

0 commit comments

Comments
 (0)