You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
BERT used **static masking** — the same mask positions every epoch, risking memorization. RoBERTa generates **new masks on-the-fly** during training:
98
78
99
-
**Dynamic masking (RoBERTa):**
100
-
- Generate new masks on-the-fly during training
101
-
- Different masks each time the same sequence is seen
102
79
- Epoch 1: "My [MASK] is cute"
103
80
- Epoch 2: "My dog [MASK] cute"
104
81
- Epoch 3: "My dog is [MASK]"
105
-
- Result: more diverse training signal, better generalization
82
+
83
+
More diverse training signal → better generalization. Combined with removing NSP and training longer on more data, this alone accounts for most of RoBERTa's gains.
106
84
107
85
</div>
108
86
@@ -322,11 +300,7 @@ discriminator(corrupted)
322
300
323
301
<divclass="note-box"data-title="Learning from every token">
324
302
325
-
**BERT:** Learns from 15% of tokens (masked ones only) — 85% of compute generates no training signal.
326
-
327
-
**ELECTRA:** Learns from 100% of tokens (all get a real/replaced label) — every position contributes to learning.
328
-
329
-
**Result:** ELECTRA reaches BERT-level performance with **4× less compute**.
303
+
ELECTRA learns from **100% of tokens** (every position gets a real/replaced label), compared to BERT's 15%. This 6.7× increase in training signal means ELECTRA reaches BERT-level performance with **4× less compute**.
330
304
331
305
</div>
332
306
@@ -482,12 +456,74 @@ for name, model_id in models.items():
482
456
| Model | Year | Key innovation | Quality vs BERT |
483
457
|-------|------|---------------|----------------|
484
458
| BERT | 2018 | MLM + NSP | Baseline |
485
-
| RoBERTa | 2019 | Better training recipe | +3-11 pts |
459
+
| RoBERTa | 2019 | Better training recipe | +3–11 pts |
SOTA on GLUE, retrieval (MTEB), and code understanding benchmarks. Available as `answerdotai/ModernBERT-base` and `answerdotai/ModernBERT-large` on HuggingFace. Proves that encoder architecture still has room to grow when given modern training techniques.
489
+
490
+
</div>
491
+
492
+
---
493
+
494
+
# Gemma Encoder and the encoder renaissance
495
+
496
+
<divclass="note-box"data-title="Google bets on encoders again (2025)">
497
+
498
+
[Gemma Encoder](https://arxiv.org/abs/2503.02656) (2025): Google's first encoder-only model since BERT, built by repurposing Gemma decoder weights for bidirectional encoding. Competitive with ModernBERT on sentence embedding tasks.
499
+
500
+
</div>
501
+
502
+
<divclass="tip-box"data-title="Why this matters">
503
+
504
+
If Google — a company betting heavily on decoder-only models (Gemini) — still releases an encoder model, it signals that encoders serve a purpose decoders can't efficiently fill. The encoder isn't dead; it's being **modernized**.
505
+
506
+
</div>
507
+
508
+
---
509
+
510
+
# Production deployment patterns
511
+
512
+
<divclass="note-box"data-title="Getting encoders from prototype to production">
DistilBERT + INT8 quantization handles classification at ~$0.001/M tokens and ~2ms per request. GPT-4 costs ~$30/M tokens at ~500ms per request. For high-volume single-task workloads, optimized encoders are **30,000× cheaper** and **250× faster**.
491
527
492
528
</div>
493
529
@@ -503,9 +539,9 @@ for name, model_id in models.items():
503
539
- In-context learning eliminates the need for task-specific architectures
504
540
505
541
**The case for "no, encoders still matter":**
506
-
-[ModernBERT](https://arxiv.org/abs/2412.13663) (Dec 2024): encoder with modern techniques (RoPE, Flash Attention, 8192 context) achieves SOTA on retrieval and classification — faster and cheaper than any decoder
507
-
-[Gemma Encoder](https://arxiv.org/abs/2503.02656) (2025): Google releases encoder-only Gemma, proving the architecture still has legs
508
-
- Encoders are 10-100x cheaper to run than decoder models for classification tasks
542
+
- ModernBERT (2024) achieves SOTA on retrieval and classification — faster and cheaper than any decoder
543
+
- Gemma Encoder (2025) proves Google still sees value in the architecture
544
+
- Encoders are 10–100× cheaper to run than decoder models for classification tasks
509
545
- Most production search and retrieval systems still use encoders (Sentence-BERT, E5, NV-Embed)
510
546
511
547
**The real answer:** It depends on your constraints. Encoders win on cost and latency. Decoders win on flexibility.
@@ -547,6 +583,8 @@ for name, model_id in models.items():
547
583
548
584
[**Warner et al. (2024, *arXiv*)**](https://arxiv.org/abs/2412.13663) "ModernBERT" — Modern encoder with RoPE + Flash Attention.
0 commit comments