Highlights
1. DeepSeek V4 Flash-oriented integration: HMA + FAWA
- HMA with FAWA: Strengthens HMA for large-model deployments (including DeepSeek V4 Flash–class setups). #953
2. Performance: Layerwise & Cache Store
- Layerwise KV load patch (vLLM-Ascend v0.18.0 / DSA-oriented paths). #911
- Pipeline-friendly shard task submission in CacheStore. #888
- Buffer reservation to mitigate load starvation for high-priority traffic. #895
- Full-graph replay for GSA sparse attention scenarios. #907
- Backend-only KV cache load mode. #951
- KV cache storage threshold and related storage controls. #925
3. Compress Store
- UCM store compression module & compression config. #940
- Compress Store logging & UX improvements. #949
4. SGLang
- SGLang + UCM integration via dynamic backend loading. #886
- SGLang quickstart docs & Dockerfile updates. #891
5. Other (stability / observability / tooling / engineering)
- Profiling switch for performance testing. #875
- DP fix: non-DP0 ranks updating file hotness correctly. #884
- Connector metadata assert fix. #897
- Duplicate metrics registration fix. #901
- Hit-rate recording and follow-ups (1TP, partial-hit dump, etc.). #917 · #926 · #927
- Trace generation support. #912
- KV cache calculator: new model support. #899
- TransBuffer / dump edge cases (reservation & ring bounds). #924
- KVCacheLayout row-count fix. #920
- MTP layer duplicate
wait_for_layer_loadfix. #932 - Stricter save/load exception handling & cleanup. #929
- vLLM 0.18.x–related patches (e.g.
finalize_kv_cache,local_cache_hit). #934 · #941 - Multi-node DP / multi-process GC fix. #937
- Docs: distributed PD disaggregation on Ascend. #921
- CI / packaging: Dockerfile refactor, metrics config in package, long e2e removed from PR gate. #902 · #916 · #947
- Logging rate-limit switch, TPOT test + config updates. #931 · #944
- Docs: quickstart & supported compute platforms. #890 · #896
- Release housekeeping: 0.5.0rc2 & bump to 0.5.0. #915 · #956
- Packaging (
__init__.py, etc.). #935 · #943
Full Changelog: v0.5.0rc1...v0.5.0
What's Changed
- [Feat]: add profiling switch for performance test by @harrisonyhq in #875
- [BugFix] Fix the bug where non-DP0 processes fail to update file hotness in DP scenarios. by @UESTC-AHao in #884
- [Perf] Pipeline-friendly shard task submission in CacheStore by @mag1c-h in #888
- [doc] Update quickstart and calculator config by @yuanzhg078 in #890
- [doc] Update supported compute platforms by @yuanzhg078 in #896
- [Bugfix] Fix the bug of connector matadata assert failure by @sumingZero in #897
- [Fix] Duplicate metrics registration by @flesher0813 in #901
- [opt] Prevent Load starvation by reserving buffers for high-priority operations by @mag1c-h in #895
- [CI] refactor dockerfile by @dante159753 in #902
- [opt] Support full-graph replay for GSA-enabled sparse attention by @wangwenxin0312 in #907
- feat(kv_cache_calculator): Add New Model Supports by @Potterluo in #899
- [doc] Update SGLang quickstart docs & Update SGLang Dockerfile by @pyxyzc in #891
- [Feature] Add SGLang UCM integration via dynamic backend loading by @pyxyzc in #886
- [Opt] Support generate traces by @flesher0813 in #912
- release 0.5.0rc2 by @qyh111 in #915
- [CI] add metrics config into package by @dante159753 in #916
- [Feat] add layerwise KV load patch for vllm-ascend v0.18.0 on DSA model by @sumingZero in #911
- [Feat] Support recording hit rate by @flesher0813 in #917
- [Fix]TransBuffer: shared freeHead let Dump bypass the reserved limit and ring bounds by @qyh111 in #924
- [Opt] Support setting a threshold for kv_cache storage by @flesher0813 in #925
- [Fix] Not print hit rate when using 1tp by @flesher0813 in #926
- [Fix] Fix dump err when 0 < len(load_blocks) < fully hit by @flesher0813 in #927
- [Fix] Update KVCacheLayout to set row counts. by @hmy98213 in #920
- [bugfix]fix mtp_layer calls wait_for_layer_load multiple times by @qyh111 in #932
- [Fix] Strengthen save/load exception check and remove unused codes by @flesher0813 in #929
- [fix]Add patch for finalize_kv_cache in 0.18.0rc1 by @qyh111 in #934
- Add init.py by @qyh111 in #935
- [BugFix] Fix the issue of multiple processes enabling GC in multi-node DP scen… by @UESTC-AHao in #937
- [feat] Add switch to log rate limit by @dante159753 in #931
- [BugFix] Add patch for local_cache_hit calculation in vllm 0.18.0 met… by @sumingZero in #941
- [Doc] Distributed PD Disaggregation on Ascend by @sumingZero in #921
- [Bugfix] Add init.py by @sumingZero in #943
- [CI] Remove long-running e2e tests from PR gate by @mag1c-h in #947
- feat(test): Add TPOT metric calculation and update test configuration by @Potterluo in #944
- [Feat] Add UCM store compression module and compression config parametersFeature store compress by @fanzhust in #940
- feat: Optimize CompressStore log printing and add Compress Store user… by @Fencyee in #949
- [Feat] Add backend-only cache load mode by @dante159753 in #951
- support hma using FAWA by @wuhuxiao in #953
- switch version to 0.5.0 by @qyh111 in #956
New Contributors
Full Changelog: v0.5.0rc1...v0.5.0