Skip to content

Commit 3d77866

Browse files
committed
feat: Integrate ExpertPredictor prefetch, CompressedMlaCache, and E2E tests
- Wire ExpertPredictor into MoE forward path: predicts likely-next experts from routing history and issues software prefetch hints (volatile read of first cache line of predicted expert gate_proj weights) before routing runs - Rebuild predictor every 16 tokens from routing history (amortized cost) - Fix routing history tracking to target first MoE layer (config.first_k_dense_replace) instead of hardcoded layer_idx==0 (layer 0 is Dense in GLM-4.7-Flash) - Integrate CompressedMlaCache as configurable mode (set_compressed_kv): stores only c_kv + k_pe (576 dims) instead of full K/V (10240 dims) per position (~17.8x memory reduction), recomputing K_nope and V during attention - Add mla_caches field initialized per-layer in load_gguf(), cleared in reset_cache() - Add 13 new tests (216 total, all passing): - E2E: forward produces logits, forward_token with KV cache, determinism, different tokens give different logits, expert predictor builds from inference, cache reset, compressed KV toggle, scratch pool allocation - Benchmarks: forward_token throughput, TL1 GEMV dispatch, RMSNorm, softmax, expert_forward performance https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
1 parent fc43023 commit 3d77866

1 file changed

Lines changed: 599 additions & 100 deletions

File tree

0 commit comments

Comments
 (0)