Skip to content

Commit a782e84

Browse files
committed
docs: Integrate RLM training stack into Craftsman Ultra ADR/DDD
Add Phase 0.5: RLM Post-Quantization Refinement — a $0 Mac Studio approach that uses the existing RLM stack (MicroLoRA, GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore) to refine the Phase 0 PTQ model by training only FP16 components (~1-2% of params). ADR-017 changes: - Added Phase 0.5 to phased decision: A(0C) → RLM Refinement → D → C → B - Added AD-19: RLM Post-Quantization Refinement architecture - Frozen ternary weights + trainable FP16 (LoRA, router, scales) - ~200-400M trainable params (1-2% of 30B), 100-500M training tokens - 100% RLM code reuse, 0% new training code - 2-12 days on Mac Studio Metal, $0 cost - Expected quality: ~70-80% of FP16 (up from 55-65% Phase 0 PTQ) - Full pipeline diagram: Router repair → MicroLoRA injection → Scale opt - Memory budget analysis: ~12-20 GB active RAM (fits any Mac Studio) - Training schedule: 3-14 days total wall time - Added Phase 0.5 exit criteria (11 items) - Updated infrastructure table with Phase 0.5 row - Updated consequences with RLM refinement benefits DDD v2.2 changes: - Added Section 3.8.1: Phase 0.5 RLM Refinement Mode - Added 5 ubiquitous language terms (RLM Refinement, Frozen Ternary, LoRA Correction, Router Repair) - Added 3 open questions (LoRA rank, GGUF persistence, Phase continuity) Key insight: RLM trains ~1% of parameters → needs ~0.25% of the data (100-500M vs 200B tokens) → Mac Studio Metal is sufficient → $0 cost. https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
1 parent 0bb8ac6 commit a782e84

2 files changed

Lines changed: 213 additions & 6 deletions

File tree

docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md

Lines changed: 172 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -364,7 +364,7 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine
364364

365365
## Decision
366366

367-
**Phased approach: A(0C) → D → C → B**
367+
**Phased approach: A(0C) → RLM Refinement → D → C → B**
368368

369369
### Phase 0: PTQ Rapid Prototype (Option A, Sub-option 0C)
370370
- **Timeline**: 1-2 weeks
@@ -382,13 +382,37 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine
382382
- **Why Mac Studio works**: Phase 0 is PTQ (no training loop) — just load FP16 weights via mmap, compute absmean per block, round to ternary, export. The absmean computation is trivial math; the bottleneck is memory bandwidth, not compute. Calibration forward pass uses Metal GPU acceleration via existing Candle integration.
383383
- **Optional upgrade (0D)**: If 0C quality is too low for meaningful testing, apply BitDistill Lite (10B tokens, ~$300 cloud or ~$0 on Mac Studio over several weeks) to reach ~90-95% quality
384384

385+
### Phase 0.5: RLM Post-Quantization Refinement (NEW — Mac Studio, $0)
386+
- **Timeline**: 1-3 weeks (overlaps with Phase 0 kernel development)
387+
- **Cost**: **$0** (runs on Mac Studio, ~2-12 days training wall time)
388+
- **Platform**: Mac Studio (same as Phase 0)
389+
- **Goal**: Improve Phase 0 PTQ quality from ~55-65% to ~70-80% by training only the small FP16 components using the existing RLM stack — **no traditional distillation, no cloud GPU**
390+
- **Approach**: Freeze ternary weights, train FP16 corrections using RLM components:
391+
1. **MicroLoRA adapters** (rank 1-2) on each expert FFN — adds small FP16 correction: `Y = BitLinear(X) + LoRA_B @ LoRA_A @ X`
392+
2. **Router fine-tuning** via ContrastiveTrainer — corrects misrouting caused by PTQ weight changes
393+
3. **Scale factor optimization** via GRPO rewards — per-block FP16 absmean scales are differentiable
394+
4. **EWC++ regularization** — prevents router fix from breaking already-good routing paths
395+
5. **Quality tracking** via MemoryDistiller — identifies worst-degraded experts for focused training
396+
6. **Policy persistence** via PolicyStore — stores optimized per-layer configurations
397+
- **Trainable parameters**: ~200-400M (1-2% of 30B total) — router (~30M), MicroLoRA adapters (~50-100M), LM head (~150M), scale factors (~0.1M)
398+
- **Training data**: 100M-500M tokens (sufficient for <400M trainable params)
399+
- **Throughput**: ~500-1000 tok/s (Metal) × 100M-500M tokens = **2-12 days on Mac Studio**
400+
- **Deliverables**:
401+
- RLM-refined GGUF with ternary experts + optimized FP16 components
402+
- MicroLoRA adapter weights (exportable, ~20-100 MB)
403+
- Optimized router weights and scale factors
404+
- Quality benchmarks showing improvement over Phase 0 baseline
405+
- **Expected quality**: **~70-80% of GLM-4.7-Flash** (up from ~55-65% Phase 0 PTQ)
406+
- **Key value**: Gets a usable model on Mac Studio at $0 before committing to cloud GPU. If 70-80% quality is sufficient for the use case, Phase 1 cloud distillation may be deferred or skipped entirely.
407+
- **100% RLM code reuse**: MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer, ContrastiveTrainer, MemoryDistiller, PolicyStore — all production-tested, zero new training code needed
408+
385409
### Phase 1: BitNet Expert Replacement (Option D)
386410
- **Timeline**: 3-4 months
387411
- **Cost**: ~$1,300-$2,000 (4× A100 spot, ~46 days)
388-
- **Goal**: Full-quality ternary experts via distillation, validated against Phase 0 baseline
412+
- **Goal**: Full-quality ternary experts via distillation, validated against Phase 0/0.5 baselines
389413
- **Deliverables**: Working Craftsman Ultra 30b 1bit (mixed: ternary experts, FP16 attention)
390414
- **Expected quality**: ~90-95% of GLM-4.7-Flash on coding benchmarks
391-
- **Prerequisites**: Phase 0 validates inference pipeline works end-to-end
415+
- **Prerequisites**: Phase 0 validates inference pipeline; Phase 0.5 provides quality baseline
392416

393417
### Phase 2: Full BitNet Distillation (Option C)
394418
- **Timeline**: 4-6 months after Phase 1
@@ -848,6 +872,7 @@ let expert_results: Vec<DistillResult> = experts
848872
|-------|----------|----------|----------------|----------|
849873
| **Phase 0 (PTQ)** | **Mac Studio (M4 Max/M3 Ultra)** | **1-4 hours** | **$0** | **Mmap FP16 weights → absmean quantize → export GGUF; Metal GPU for calibration pass** |
850874
| Phase 0D (BitDistill Lite, 10B tok) | Mac Studio Metal or 1× A100 spot | 2-4 weeks (local) / 1-2 days (cloud) | $0 (local) / ~$300 (cloud) | Optional quality upgrade if Phase 0C too degraded |
875+
| **Phase 0.5 (RLM refinement, 100-500M tok)** | **Mac Studio (Metal)** | **3-14 days** | **$0** | **MicroLoRA + router fix + scale opt using existing RLM stack** |
851876
| Phase 1 (expert FFN, 200B tok) | 4× A100 80GB spot (GCP) | ~46 days | $1,300-$2,000 | Per-expert sequential with EWC++; each expert fits 1 GPU |
852877
| Phase 1 (router validation) | Mac Studio Metal or 1× A100 | ~2-4 hours | $0 (local) / <$10 (cloud) | Contrastive training on router only (~2B params) |
853878
| Phase 2 (full ternary, 500B tok) | 4× H100 (DataCrunch) | ~16-32 days | $2,500-$5,000 | All layers; model-parallel across GPUs |
@@ -1002,6 +1027,134 @@ pub struct PtBitnetConfig {
10021027
**Reused**: GGUF parser, tensor metadata, `GgufQuantType` enum, export pipeline.
10031028
**New**: `PtBitnetQuantizer`, `absmean_ternary()`, `BITNET_T158` dequantization kernel.
10041029

1030+
### AD-19: Phase 0.5 — RLM Post-Quantization Refinement (No Traditional Training)
1031+
1032+
**Decision**: Use the existing RLM training stack to refine the Phase 0 PTQ model on Mac Studio by training only the small FP16 components (~1-2% of parameters), freezing ternary weights. This replaces traditional distillation for the rapid prototype phase.
1033+
1034+
**Rationale**: Traditional knowledge distillation (Phase 1) requires shadow weights, straight-through estimator, and GPU-scale compute to modify the ternary weights themselves. However, the Phase 0 PTQ model already has ternary weights — the quality loss comes from:
1035+
1. Sub-optimal per-block scale factors (absmean is a rough approximation)
1036+
2. MoE router misrouting tokens to wrong experts (expert output distributions changed)
1037+
3. No adaptation to ternary output characteristics
1038+
1039+
All three can be addressed by training only the FP16 components using the existing RLM stack, without touching the ternary weights.
1040+
1041+
**What gets trained (FP16, differentiable) vs frozen (ternary, not differentiable):**
1042+
1043+
| Component | Params | Size | Trainable? | Training Method |
1044+
|-----------|--------|------|------------|----------------|
1045+
| Expert FFN ternary weights | ~28B | ~5.5 GB | **Frozen** | N/A — {-1,0,+1} not differentiable |
1046+
| MicroLoRA adapters (rank-2, per expert FFN) | ~50-100M | ~100-200 MB | **Yes** | `TrainingPipeline` + `EwcRegularizer` |
1047+
| MoE router gating weights | ~30M | ~60 MB | **Yes** | `ContrastiveTrainer` (triplet + InfoNCE) |
1048+
| Per-block absmean scale factors | ~0.1M | ~200 KB | **Yes** | GRPO reward-guided optimization |
1049+
| LM head (output projection) | ~150M | ~300 MB | **Yes (optional)** | Standard fine-tuning |
1050+
| Attention Q/K/V/O (FP16) | ~2B | ~4 GB | **Optional** | Can add LoRA here too if budget allows |
1051+
| **Total trainable** | **~200-400M** | **~400-800 MB** | | **~1-2% of 30B total** |
1052+
1053+
**Why RLM works here (vs traditional distillation):**
1054+
1055+
| Property | Traditional KD (Phase 1) | RLM Refinement (Phase 0.5) |
1056+
|----------|--------------------------|----------------------------|
1057+
| Modifies ternary weights | Yes (shadow weights + STE) | No (frozen) |
1058+
| Trainable params | ~28B (all expert weights) | ~200-400M (1-2%) |
1059+
| Training tokens needed | 200B | 100M-500M (400x less) |
1060+
| GPU requirement | 4× A100 ($1,300+) | Mac Studio Metal ($0) |
1061+
| Training time | ~46 days (cloud) | **2-12 days (local)** |
1062+
| Quality target | ~90-95% of FP16 | ~70-80% of FP16 |
1063+
| New code required | ~15,000 lines (BitLinear, STE, orchestrator) | **~0 lines** (100% RLM reuse) |
1064+
1065+
**RLM component mapping:**
1066+
1067+
```
1068+
┌──────────────────────────────────────────────────────────────────┐
1069+
│ Phase 0.5: RLM Refinement Pipeline │
1070+
│ (100% existing RLM code, 0% new training code) │
1071+
│ │
1072+
│ Frozen Ternary Model (Phase 0 PTQ output) │
1073+
│ ┌────────────────────────────────────────────┐ │
1074+
│ │ Expert FFNs: {-1,0,+1} weights (FROZEN) │ │
1075+
│ │ Router: FP16 gating (TRAINABLE) │ │
1076+
│ │ Attention: FP16 (TRAINABLE via LoRA opt.) │ │
1077+
│ │ Scales: FP16 per-block (TRAINABLE) │ │
1078+
│ └────────────────────────────────────────────┘ │
1079+
│ │ │
1080+
│ ┌─────▼──────────────────────────────────────────┐ │
1081+
│ │ Step 1: Router Repair │ │
1082+
│ │ ContrastiveTrainer (REUSED, contrastive.rs) │ │
1083+
│ │ • Generate triplets: anchor=hidden, +correct │ │
1084+
│ │ expert, -wrong expert │ │
1085+
│ │ • Triplet + InfoNCE loss on FP16 router │ │
1086+
│ │ • Fix misrouting from PTQ weight changes │ │
1087+
│ │ Training: ~10M tokens, ~1-2 hours (Metal) │ │
1088+
│ └─────┬──────────────────────────────────────────┘ │
1089+
│ │ │
1090+
│ ┌─────▼──────────────────────────────────────────┐ │
1091+
│ │ Step 2: MicroLoRA Injection + Training │ │
1092+
│ │ TrainingPipeline + MicroLoRA (REUSED, │ │
1093+
│ │ lora/training.rs + lora/micro_lora.rs) │ │
1094+
│ │ • Rank-2 LoRA per expert FFN: Y = BitLinear(X) │ │
1095+
│ │ + LoRA_B @ LoRA_A @ X │ │
1096+
│ │ • Loss: MSE(teacher_output, student+LoRA) │ │
1097+
│ │ • EWC++ across expert phases │ │
1098+
│ │ Training: ~100-500M tokens, ~2-12 days (Metal) │ │
1099+
│ └─────┬──────────────────────────────────────────┘ │
1100+
│ │ │
1101+
│ ┌─────▼──────────────────────────────────────────┐ │
1102+
│ │ Step 3: Scale Factor + Quality Optimization │ │
1103+
│ │ GrpoOptimizer (REUSED, grpo.rs) │ │
1104+
│ │ • Per-expert output quality → reward signal │ │
1105+
│ │ • Optimize FP16 scale factors to maximize │ │
1106+
│ │ cosine similarity with teacher output │ │
1107+
│ │ • Adaptive KL prevents over-correction │ │
1108+
│ │ Training: concurrent with Step 2 │ │
1109+
│ └─────┬──────────────────────────────────────────┘ │
1110+
│ │ │
1111+
│ ┌─────▼──────────────────────────────────────────┐ │
1112+
│ │ Feedback Loop │ │
1113+
│ │ MemoryDistiller → KeyLessons (REUSED) │ │
1114+
│ │ PolicyStore → TernaryScale policies (REUSED) │ │
1115+
│ │ • Track which experts improve most │ │
1116+
│ │ • Store optimized configs for reproducibility │ │
1117+
│ └────────────────────────────────────────────────┘ │
1118+
└──────────────────────────────────────────────────────────────────┘
1119+
```
1120+
1121+
**Memory budget on Mac Studio during Phase 0.5 training:**
1122+
1123+
| Component | Size | Notes |
1124+
|-----------|------|-------|
1125+
| PTQ ternary model (mmap) | ~7 GB disk / ~3-7 GB RAM | Demand-paged; only active expert pages in RAM |
1126+
| Teacher FP16 model (mmap) | ~60 GB disk / ~4-8 GB RAM | Only forward pass activations; demand-paged |
1127+
| MicroLoRA adapters (rank-2) | ~200 MB | All experts in RAM |
1128+
| LoRA gradients + optimizer (AdamW 2×FP32) | ~1.5 GB | For ~400M trainable params |
1129+
| EWC++ Fisher diagonal | ~200 MB | Per-expert accumulated |
1130+
| KV cache + activations | ~2 GB | Calibration/training forward pass |
1131+
| **Total active RAM** | **~12-20 GB** | **Fits in any Mac Studio config** |
1132+
1133+
**Key insight**: The teacher model is only needed for forward pass (no gradients), so it can be mmap'd and demand-paged. The ternary student is similarly mmap'd. Only the ~400M trainable parameters and their optimizer state need to be fully in RAM (~2 GB), which fits comfortably in even the 36GB M4 Max.
1134+
1135+
**Training schedule on Mac Studio M4 Max 128GB:**
1136+
1137+
| Step | Tokens | Wall Time | What Changes |
1138+
|------|--------|-----------|-------------|
1139+
| Router repair | ~10M | ~3-6 hours | FP16 router gating weights |
1140+
| LoRA training (per-expert, sequential) | ~100-500M | 2-12 days | MicroLoRA A/B matrices per expert FFN |
1141+
| Scale optimization | ~10M | ~3-6 hours | Per-block FP16 absmean scales |
1142+
| Validation + export || ~1-2 hours | Benchmark + GGUF re-export |
1143+
| **Total** | **~120-520M** | **~3-14 days** | |
1144+
1145+
**Expected quality improvement:**
1146+
1147+
| Benchmark | Phase 0 PTQ | Phase 0.5 RLM | Phase 1 Distill | FP16 Baseline |
1148+
|-----------|------------|--------------|----------------|---------------|
1149+
| HumanEval pass@1 | ~35-45% | **~45-55%** | ~55-60% | ~65% |
1150+
| MMLU | ~45-55% | **~55-65%** | ~65-70% | ~75% |
1151+
| SWE-bench Verified | ~25-35% | **~35-45%** | ~50-55% | 59.2% |
1152+
1153+
**The question "can I use RLM rather than traditional training" is answered YES** — with the critical caveat that RLM refinement trains the FP16 corrections around frozen ternary weights, not the ternary weights themselves. This is fundamentally different from traditional distillation but achieves meaningful quality recovery (estimated +10-15 percentage points) at zero cost.
1154+
1155+
**Reused (100%)**: `MicroLoRA`, `TrainingPipeline`, `EwcRegularizer`, `GrpoOptimizer`, `ContrastiveTrainer`, `MemoryDistiller`, `PolicyStore`, `TrainingConfig`, LR schedules, GGUF export.
1156+
**New (0%)**: No new training code. The only new code is a thin `RlmRefiner` orchestrator (~200-300 lines) that wires the existing components together for the Phase 0.5 pipeline.
1157+
10051158
---
10061159

10071160
## Consequences
@@ -1014,14 +1167,16 @@ pub struct PtBitnetConfig {
10141167
4. **Multiplication-free expert GEMM**: Integer addition only in expert forward passes
10151168
5. **SONA compatibility**: MicroLoRA adaptation preserves per-session learning
10161169
6. **GGUF ecosystem**: Compatible with existing model distribution infrastructure
1017-
7. **Incremental path**: Phase 0 validates at ~$100; Phase 1 delivers quality; Phases 2-3 optimize
1170+
7. **Incremental path**: Phase 0 ($0) validates pipeline; Phase 0.5 ($0) adds RLM quality boost; Phase 1 ($1,300) delivers production quality; Phases 2-3 optimize
10181171
8. **~70% RLM code reuse**: GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore are production-tested — only BitLinear layer and orchestrator are net-new
10191172
9. **Adaptive distillation**: GRPO reward scaling dynamically focuses compute on hard-to-distill experts
10201173
10. **Cross-expert stability**: EWC++ Fisher diagonal prevents catastrophic forgetting during sequential expert distillation
10211174
11. **Learned quantization policies**: PolicyStore persists per-layer ternary scale distributions for reproducible future distillation runs
10221175
12. **Expert-parallel distillation**: Independent expert FFNs enable rayon-parallel distillation across CPU cores
10231176
13. **Phase 0 de-risks Phase 1 at zero cost**: Mac Studio PTQ prototype validates entire inference pipeline (GGUF → dequant → kernel → MoE → generation) for $0 before committing $1,300+ to cloud GPU distillation
10241177
14. **Existing GGUF ecosystem**: Community-published GLM-4.7-Flash GGUFs (bartowski, unsloth) available as comparison baselines
1178+
15. **Phase 0.5 RLM refinement at $0**: Existing MicroLoRA + GRPO + EWC++ + ContrastiveTrainer stack provides ~10-15 percentage point quality recovery over raw PTQ with zero new training code, running entirely on Mac Studio
1179+
16. **100% RLM reuse for Phase 0.5**: No new training infrastructure needed — all 7 RLM components are production-tested and wire together directly
10251180

10261181
### Negative
10271182

@@ -1064,6 +1219,19 @@ pub struct PtBitnetConfig {
10641219
- [ ] Baseline quality benchmarks recorded (HumanEval, MMLU) as Phase 1 improvement target
10651220
- [ ] Total Phase 0 cost = $0 (local Mac Studio execution)
10661221

1222+
### Phase 0.5 Exit Criteria
1223+
- [ ] MicroLoRA adapters (rank-2) attached to all expert FFN layers
1224+
- [ ] Router fine-tuning via ContrastiveTrainer restores >=90% routing accuracy vs teacher
1225+
- [ ] GRPO reward signal shows positive quality improvement over Phase 0 baseline
1226+
- [ ] EWC++ prevents router fix from degrading already-correct routing paths (Fisher delta < 5%)
1227+
- [ ] HumanEval pass@1 >= 45% (up from Phase 0 baseline of ~35-45%)
1228+
- [ ] MicroLoRA + ternary inference produces coherent code completions
1229+
- [ ] Training completes on Mac Studio within 14 days
1230+
- [ ] MemoryDistiller has extracted KeyLessons identifying worst-degraded experts
1231+
- [ ] PolicyStore contains optimized TernaryScale entries for all refined layers
1232+
- [ ] Total Phase 0.5 cost = $0 (local Mac Studio execution)
1233+
- [ ] GGUF re-exported with optimized router, scale factors, and LoRA adapter weights
1234+
10671235
### Phase 1 Exit Criteria
10681236
- [ ] BitNet backend loads GGUF with ternary expert weights
10691237
- [ ] TL1 kernel produces bit-exact output vs reference float implementation

0 commit comments

Comments
 (0)