You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Integrate RLM training stack into Craftsman Ultra ADR/DDD
Add Phase 0.5: RLM Post-Quantization Refinement — a $0 Mac Studio
approach that uses the existing RLM stack (MicroLoRA, GRPO, EWC++,
ContrastiveTrainer, MemoryDistiller, PolicyStore) to refine the
Phase 0 PTQ model by training only FP16 components (~1-2% of params).
ADR-017 changes:
- Added Phase 0.5 to phased decision: A(0C) → RLM Refinement → D → C → B
- Added AD-19: RLM Post-Quantization Refinement architecture
- Frozen ternary weights + trainable FP16 (LoRA, router, scales)
- ~200-400M trainable params (1-2% of 30B), 100-500M training tokens
- 100% RLM code reuse, 0% new training code
- 2-12 days on Mac Studio Metal, $0 cost
- Expected quality: ~70-80% of FP16 (up from 55-65% Phase 0 PTQ)
- Full pipeline diagram: Router repair → MicroLoRA injection → Scale opt
- Memory budget analysis: ~12-20 GB active RAM (fits any Mac Studio)
- Training schedule: 3-14 days total wall time
- Added Phase 0.5 exit criteria (11 items)
- Updated infrastructure table with Phase 0.5 row
- Updated consequences with RLM refinement benefits
DDD v2.2 changes:
- Added Section 3.8.1: Phase 0.5 RLM Refinement Mode
- Added 5 ubiquitous language terms (RLM Refinement, Frozen Ternary,
LoRA Correction, Router Repair)
- Added 3 open questions (LoRA rank, GGUF persistence, Phase continuity)
Key insight: RLM trains ~1% of parameters → needs ~0.25% of the data
(100-500M vs 200B tokens) → Mac Studio Metal is sufficient → $0 cost.
https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
Copy file name to clipboardExpand all lines: docs/adr/ADR-017-craftsman-ultra-30b-1bit-bitnet-integration.md
+172-4Lines changed: 172 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -364,7 +364,7 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine
364
364
365
365
## Decision
366
366
367
-
**Phased approach: A(0C) → D → C → B**
367
+
**Phased approach: A(0C) → RLM Refinement → D → C → B**
368
368
369
369
### Phase 0: PTQ Rapid Prototype (Option A, Sub-option 0C)
370
370
-**Timeline**: 1-2 weeks
@@ -382,13 +382,37 @@ Keep GLM-4.7-Flash structure but replace only the expert MLP layers with BitLine
382
382
-**Why Mac Studio works**: Phase 0 is PTQ (no training loop) — just load FP16 weights via mmap, compute absmean per block, round to ternary, export. The absmean computation is trivial math; the bottleneck is memory bandwidth, not compute. Calibration forward pass uses Metal GPU acceleration via existing Candle integration.
383
383
-**Optional upgrade (0D)**: If 0C quality is too low for meaningful testing, apply BitDistill Lite (10B tokens, ~$300 cloud or ~$0 on Mac Studio over several weeks) to reach ~90-95% quality
384
384
385
+
### Phase 0.5: RLM Post-Quantization Refinement (NEW — Mac Studio, $0)
386
+
-**Timeline**: 1-3 weeks (overlaps with Phase 0 kernel development)
387
+
-**Cost**: **$0** (runs on Mac Studio, ~2-12 days training wall time)
388
+
-**Platform**: Mac Studio (same as Phase 0)
389
+
-**Goal**: Improve Phase 0 PTQ quality from ~55-65% to ~70-80% by training only the small FP16 components using the existing RLM stack — **no traditional distillation, no cloud GPU**
390
+
-**Approach**: Freeze ternary weights, train FP16 corrections using RLM components:
391
+
1.**MicroLoRA adapters** (rank 1-2) on each expert FFN — adds small FP16 correction: `Y = BitLinear(X) + LoRA_B @ LoRA_A @ X`
392
+
2.**Router fine-tuning** via ContrastiveTrainer — corrects misrouting caused by PTQ weight changes
393
+
3.**Scale factor optimization** via GRPO rewards — per-block FP16 absmean scales are differentiable
- Quality benchmarks showing improvement over Phase 0 baseline
405
+
-**Expected quality**: **~70-80% of GLM-4.7-Flash** (up from ~55-65% Phase 0 PTQ)
406
+
-**Key value**: Gets a usable model on Mac Studio at $0 before committing to cloud GPU. If 70-80% quality is sufficient for the use case, Phase 1 cloud distillation may be deferred or skipped entirely.
407
+
-**100% RLM code reuse**: MicroLoRA, TrainingPipeline, EwcRegularizer, GrpoOptimizer, ContrastiveTrainer, MemoryDistiller, PolicyStore — all production-tested, zero new training code needed
|**Phase 0 (PTQ)**|**Mac Studio (M4 Max/M3 Ultra)**|**1-4 hours**|**$0**|**Mmap FP16 weights → absmean quantize → export GGUF; Metal GPU for calibration pass**|
850
874
| Phase 0D (BitDistill Lite, 10B tok) | Mac Studio Metal or 1× A100 spot | 2-4 weeks (local) / 1-2 days (cloud) | $0 (local) / ~$300 (cloud) | Optional quality upgrade if Phase 0C too degraded |
875
+
|**Phase 0.5 (RLM refinement, 100-500M tok)**|**Mac Studio (Metal)**|**3-14 days**|**$0**|**MicroLoRA + router fix + scale opt using existing RLM stack**|
851
876
| Phase 1 (expert FFN, 200B tok) | 4× A100 80GB spot (GCP) |~46 days | $1,300-$2,000 | Per-expert sequential with EWC++; each expert fits 1 GPU |
852
877
| Phase 1 (router validation) | Mac Studio Metal or 1× A100 |~2-4 hours | $0 (local) / <$10 (cloud) | Contrastive training on router only (~2B params) |
853
878
| Phase 2 (full ternary, 500B tok) | 4× H100 (DataCrunch) |~16-32 days | $2,500-$5,000 | All layers; model-parallel across GPUs |
### AD-19: Phase 0.5 — RLM Post-Quantization Refinement (No Traditional Training)
1031
+
1032
+
**Decision**: Use the existing RLM training stack to refine the Phase 0 PTQ model on Mac Studio by training only the small FP16 components (~1-2% of parameters), freezing ternary weights. This replaces traditional distillation for the rapid prototype phase.
1033
+
1034
+
**Rationale**: Traditional knowledge distillation (Phase 1) requires shadow weights, straight-through estimator, and GPU-scale compute to modify the ternary weights themselves. However, the Phase 0 PTQ model already has ternary weights — the quality loss comes from:
1035
+
1. Sub-optimal per-block scale factors (absmean is a rough approximation)
|**Total active RAM**|**~12-20 GB**|**Fits in any Mac Studio config**|
1132
+
1133
+
**Key insight**: The teacher model is only needed for forward pass (no gradients), so it can be mmap'd and demand-paged. The ternary student is similarly mmap'd. Only the ~400M trainable parameters and their optimizer state need to be fully in RAM (~2 GB), which fits comfortably in even the 36GB M4 Max.
**The question "can I use RLM rather than traditional training" is answered YES** — with the critical caveat that RLM refinement trains the FP16 corrections around frozen ternary weights, not the ternary weights themselves. This is fundamentally different from traditional distillation but achieves meaningful quality recovery (estimated +10-15 percentage points) at zero cost.
**New (0%)**: No new training code. The only new code is a thin `RlmRefiner` orchestrator (~200-300 lines) that wires the existing components together for the Phase 0.5 pipeline.
8.**~70% RLM code reuse**: GRPO, EWC++, ContrastiveTrainer, MemoryDistiller, PolicyStore are production-tested — only BitLinear layer and orchestrator are net-new
12.**Expert-parallel distillation**: Independent expert FFNs enable rayon-parallel distillation across CPU cores
1023
1176
13.**Phase 0 de-risks Phase 1 at zero cost**: Mac Studio PTQ prototype validates entire inference pipeline (GGUF → dequant → kernel → MoE → generation) for $0 before committing $1,300+ to cloud GPU distillation
1024
1177
14.**Existing GGUF ecosystem**: Community-published GLM-4.7-Flash GGUFs (bartowski, unsloth) available as comparison baselines
1178
+
15.**Phase 0.5 RLM refinement at $0**: Existing MicroLoRA + GRPO + EWC++ + ContrastiveTrainer stack provides ~10-15 percentage point quality recovery over raw PTQ with zero new training code, running entirely on Mac Studio
1179
+
16.**100% RLM reuse for Phase 0.5**: No new training infrastructure needed — all 7 RLM components are production-tested and wire together directly
0 commit comments