Release v0.5.0 · ModelEngine-Group/unified-cache-management

Highlights

1. DeepSeek V4 Flash-oriented integration: HMA + FAWA

HMA with FAWA: Strengthens HMA for large-model deployments (including DeepSeek V4 Flash–class setups). #953

2. Performance: Layerwise & Cache Store

Layerwise KV load patch (vLLM-Ascend v0.18.0 / DSA-oriented paths). #911
Pipeline-friendly shard task submission in CacheStore. #888
Buffer reservation to mitigate load starvation for high-priority traffic. #895
Full-graph replay for GSA sparse attention scenarios. #907
Backend-only KV cache load mode. #951
KV cache storage threshold and related storage controls. #925

3. Compress Store

UCM store compression module & compression config. #940
Compress Store logging & UX improvements. #949

4. SGLang

SGLang + UCM integration via dynamic backend loading. #886
SGLang quickstart docs & Dockerfile updates. #891

5. Other (stability / observability / tooling / engineering)

Profiling switch for performance testing. #875
DP fix: non-DP0 ranks updating file hotness correctly. #884
Connector metadata assert fix. #897
Duplicate metrics registration fix. #901
Hit-rate recording and follow-ups (1TP, partial-hit dump, etc.). #917 · #926 · #927
Trace generation support. #912
KV cache calculator: new model support. #899
TransBuffer / dump edge cases (reservation & ring bounds). #924
KVCacheLayout row-count fix. #920
MTP layer duplicate wait_for_layer_load fix. #932
Stricter save/load exception handling & cleanup. #929
vLLM 0.18.x–related patches (e.g. finalize_kv_cache, local_cache_hit). #934 · #941
Multi-node DP / multi-process GC fix. #937
Docs: distributed PD disaggregation on Ascend. #921
CI / packaging: Dockerfile refactor, metrics config in package, long e2e removed from PR gate. #902 · #916 · #947
Logging rate-limit switch, TPOT test + config updates. #931 · #944
Docs: quickstart & supported compute platforms. #890 · #896
Release housekeeping: 0.5.0rc2 & bump to 0.5.0. #915 · #956
Packaging (__init__.py, etc.). #935 · #943

Full Changelog: v0.5.0rc1...v0.5.0

What's Changed

[Feat]: add profiling switch for performance test by @harrisonyhq in #875
[BugFix] Fix the bug where non-DP0 processes fail to update file hotness in DP scenarios. by @UESTC-AHao in #884
[Perf] Pipeline-friendly shard task submission in CacheStore by @mag1c-h in #888
[doc] Update quickstart and calculator config by @yuanzhg078 in #890
[doc] Update supported compute platforms by @yuanzhg078 in #896
[Bugfix] Fix the bug of connector matadata assert failure by @sumingZero in #897
[Fix] Duplicate metrics registration by @flesher0813 in #901
[opt] Prevent Load starvation by reserving buffers for high-priority operations by @mag1c-h in #895
[CI] refactor dockerfile by @dante159753 in #902
[opt] Support full-graph replay for GSA-enabled sparse attention by @wangwenxin0312 in #907
feat(kv_cache_calculator): Add New Model Supports by @Potterluo in #899
[doc] Update SGLang quickstart docs & Update SGLang Dockerfile by @pyxyzc in #891
[Feature] Add SGLang UCM integration via dynamic backend loading by @pyxyzc in #886
[Opt] Support generate traces by @flesher0813 in #912
release 0.5.0rc2 by @qyh111 in #915
[CI] add metrics config into package by @dante159753 in #916
[Feat] add layerwise KV load patch for vllm-ascend v0.18.0 on DSA model by @sumingZero in #911
[Feat] Support recording hit rate by @flesher0813 in #917
[Fix]TransBuffer: shared freeHead let Dump bypass the reserved limit and ring bounds by @qyh111 in #924
[Opt] Support setting a threshold for kv_cache storage by @flesher0813 in #925
[Fix] Not print hit rate when using 1tp by @flesher0813 in #926
[Fix] Fix dump err when 0 < len(load_blocks) < fully hit by @flesher0813 in #927
[Fix] Update KVCacheLayout to set row counts. by @hmy98213 in #920
[bugfix]fix mtp_layer calls wait_for_layer_load multiple times by @qyh111 in #932
[Fix] Strengthen save/load exception check and remove unused codes by @flesher0813 in #929
[fix]Add patch for finalize_kv_cache in 0.18.0rc1 by @qyh111 in #934
Add init.py by @qyh111 in #935
[BugFix] Fix the issue of multiple processes enabling GC in multi-node DP scen… by @UESTC-AHao in #937
[feat] Add switch to log rate limit by @dante159753 in #931
[BugFix] Add patch for local_cache_hit calculation in vllm 0.18.0 met… by @sumingZero in #941
[Doc] Distributed PD Disaggregation on Ascend by @sumingZero in #921
[Bugfix] Add init.py by @sumingZero in #943
[CI] Remove long-running e2e tests from PR gate by @mag1c-h in #947
feat(test): Add TPOT metric calculation and update test configuration by @Potterluo in #944
[Feat] Add UCM store compression module and compression config parametersFeature store compress by @fanzhust in #940
feat: Optimize CompressStore log printing and add Compress Store user… by @Fencyee in #949
[Feat] Add backend-only cache load mode by @dante159753 in #951
support hma using FAWA by @wuhuxiao in #953
switch version to 0.5.0 by @qyh111 in #956

New Contributors

@fanzhust made their first contribution in #940
@Fencyee made their first contribution in #949

Full Changelog: v0.5.0rc1...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

1. DeepSeek V4 Flash-oriented integration: HMA + FAWA

2. Performance: Layerwise & Cache Store

3. Compress Store

4. SGLang

5. Other (stability / observability / tooling / engineering)

What's Changed

New Contributors

Contributors

Uh oh!