Commit 3d77866
committed
feat: Integrate ExpertPredictor prefetch, CompressedMlaCache, and E2E tests
- Wire ExpertPredictor into MoE forward path: predicts likely-next experts
from routing history and issues software prefetch hints (volatile read of
first cache line of predicted expert gate_proj weights) before routing runs
- Rebuild predictor every 16 tokens from routing history (amortized cost)
- Fix routing history tracking to target first MoE layer (config.first_k_dense_replace)
instead of hardcoded layer_idx==0 (layer 0 is Dense in GLM-4.7-Flash)
- Integrate CompressedMlaCache as configurable mode (set_compressed_kv):
stores only c_kv + k_pe (576 dims) instead of full K/V (10240 dims) per
position (~17.8x memory reduction), recomputing K_nope and V during attention
- Add mla_caches field initialized per-layer in load_gguf(), cleared in reset_cache()
- Add 13 new tests (216 total, all passing):
- E2E: forward produces logits, forward_token with KV cache, determinism,
different tokens give different logits, expert predictor builds from inference,
cache reset, compressed KV toggle, scratch pool allocation
- Benchmarks: forward_token throughput, TL1 GEMV dispatch, RMSNorm, softmax,
expert_forward performance
https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK1 parent fc43023 commit 3d77866
1 file changed
Lines changed: 599 additions & 100 deletions
0 commit comments