Skip to content

Commit 98cc1af

Browse files
committed
feat: Add streaming generation, predictive expert prefetcher, and compressed MLA KV cache
- Streaming generation API (generate_streaming) with per-token callback, early stopping, and GenerationStats for throughput metrics - ExpertPredictor: transition-matrix based predictor that learns from routing history to predict next experts with Laplace smoothing - CompressedMlaCache: stores compressed latents (c_kv + k_pe) instead of full K/V, achieving ~17.8x memory reduction for GLM-4.7-Flash - 15 new tests (203 total bitnet tests, all passing) https://claude.ai/code/session_011nTcGcn49b8YKJRVoh4TaK
1 parent 8093376 commit 98cc1af

2 files changed

Lines changed: 548 additions & 2 deletions

File tree

0 commit comments

Comments
 (0)