Commit 98cc1af
committed
feat: Add streaming generation, predictive expert prefetcher, and compressed MLA KV cache
- Streaming generation API (generate_streaming) with per-token callback,
early stopping, and GenerationStats for throughput metrics
- ExpertPredictor: transition-matrix based predictor that learns from
routing history to predict next experts with Laplace smoothing
- CompressedMlaCache: stores compressed latents (c_kv + k_pe) instead
of full K/V, achieving ~17.8x memory reduction for GLM-4.7-Flash
- 15 new tests (203 total bitnet tests, all passing)
https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK1 parent 8093376 commit 98cc1af
2 files changed
Lines changed: 548 additions & 2 deletions
0 commit comments