Skip to content

v0.5.0

Latest

Choose a tag to compare

@qyh111 qyh111 released this 13 May 09:35
· 20 commits to develop since this release
52b0f54

Highlights

1. DeepSeek V4 Flash-oriented integration: HMA + FAWA

  • HMA with FAWA: Strengthens HMA for large-model deployments (including DeepSeek V4 Flash–class setups). #953

2. Performance: Layerwise & Cache Store

  • Layerwise KV load patch (vLLM-Ascend v0.18.0 / DSA-oriented paths). #911
  • Pipeline-friendly shard task submission in CacheStore. #888
  • Buffer reservation to mitigate load starvation for high-priority traffic. #895
  • Full-graph replay for GSA sparse attention scenarios. #907
  • Backend-only KV cache load mode. #951
  • KV cache storage threshold and related storage controls. #925

3. Compress Store

  • UCM store compression module & compression config. #940
  • Compress Store logging & UX improvements. #949

4. SGLang

  • SGLang + UCM integration via dynamic backend loading. #886
  • SGLang quickstart docs & Dockerfile updates. #891

5. Other (stability / observability / tooling / engineering)

  • Profiling switch for performance testing. #875
  • DP fix: non-DP0 ranks updating file hotness correctly. #884
  • Connector metadata assert fix. #897
  • Duplicate metrics registration fix. #901
  • Hit-rate recording and follow-ups (1TP, partial-hit dump, etc.). #917 · #926 · #927
  • Trace generation support. #912
  • KV cache calculator: new model support. #899
  • TransBuffer / dump edge cases (reservation & ring bounds). #924
  • KVCacheLayout row-count fix. #920
  • MTP layer duplicate wait_for_layer_load fix. #932
  • Stricter save/load exception handling & cleanup. #929
  • vLLM 0.18.x–related patches (e.g. finalize_kv_cache, local_cache_hit). #934 · #941
  • Multi-node DP / multi-process GC fix. #937
  • Docs: distributed PD disaggregation on Ascend. #921
  • CI / packaging: Dockerfile refactor, metrics config in package, long e2e removed from PR gate. #902 · #916 · #947
  • Logging rate-limit switch, TPOT test + config updates. #931 · #944
  • Docs: quickstart & supported compute platforms. #890 · #896
  • Release housekeeping: 0.5.0rc2 & bump to 0.5.0. #915 · #956
  • Packaging (__init__.py, etc.). #935 · #943

Full Changelog: v0.5.0rc1...v0.5.0

What's Changed

  • [Feat]: add profiling switch for performance test by @harrisonyhq in #875
  • [BugFix] Fix the bug where non-DP0 processes fail to update file hotness in DP scenarios. by @UESTC-AHao in #884
  • [Perf] Pipeline-friendly shard task submission in CacheStore by @mag1c-h in #888
  • [doc] Update quickstart and calculator config by @yuanzhg078 in #890
  • [doc] Update supported compute platforms by @yuanzhg078 in #896
  • [Bugfix] Fix the bug of connector matadata assert failure by @sumingZero in #897
  • [Fix] Duplicate metrics registration by @flesher0813 in #901
  • [opt] Prevent Load starvation by reserving buffers for high-priority operations by @mag1c-h in #895
  • [CI] refactor dockerfile by @dante159753 in #902
  • [opt] Support full-graph replay for GSA-enabled sparse attention by @wangwenxin0312 in #907
  • feat(kv_cache_calculator): Add New Model Supports by @Potterluo in #899
  • [doc] Update SGLang quickstart docs & Update SGLang Dockerfile by @pyxyzc in #891
  • [Feature] Add SGLang UCM integration via dynamic backend loading by @pyxyzc in #886
  • [Opt] Support generate traces by @flesher0813 in #912
  • release 0.5.0rc2 by @qyh111 in #915
  • [CI] add metrics config into package by @dante159753 in #916
  • [Feat] add layerwise KV load patch for vllm-ascend v0.18.0 on DSA model by @sumingZero in #911
  • [Feat] Support recording hit rate by @flesher0813 in #917
  • [Fix]TransBuffer: shared freeHead let Dump bypass the reserved limit and ring bounds by @qyh111 in #924
  • [Opt] Support setting a threshold for kv_cache storage by @flesher0813 in #925
  • [Fix] Not print hit rate when using 1tp by @flesher0813 in #926
  • [Fix] Fix dump err when 0 < len(load_blocks) < fully hit by @flesher0813 in #927
  • [Fix] Update KVCacheLayout to set row counts. by @hmy98213 in #920
  • [bugfix]fix mtp_layer calls wait_for_layer_load multiple times by @qyh111 in #932
  • [Fix] Strengthen save/load exception check and remove unused codes by @flesher0813 in #929
  • [fix]Add patch for finalize_kv_cache in 0.18.0rc1 by @qyh111 in #934
  • Add init.py by @qyh111 in #935
  • [BugFix] Fix the issue of multiple processes enabling GC in multi-node DP scen… by @UESTC-AHao in #937
  • [feat] Add switch to log rate limit by @dante159753 in #931
  • [BugFix] Add patch for local_cache_hit calculation in vllm 0.18.0 met… by @sumingZero in #941
  • [Doc] Distributed PD Disaggregation on Ascend by @sumingZero in #921
  • [Bugfix] Add init.py by @sumingZero in #943
  • [CI] Remove long-running e2e tests from PR gate by @mag1c-h in #947
  • feat(test): Add TPOT metric calculation and update test configuration by @Potterluo in #944
  • [Feat] Add UCM store compression module and compression config parametersFeature store compress by @fanzhust in #940
  • feat: Optimize CompressStore log printing and add Compress Store user… by @Fencyee in #949
  • [Feat] Add backend-only cache load mode by @dante159753 in #951
  • support hma using FAWA by @wuhuxiao in #953
  • switch version to 0.5.0 by @qyh111 in #956

New Contributors

Full Changelog: v0.5.0rc1...v0.5.0