Skip to content

Releases: ModelEngine-Group/unified-cache-management

v0.5.0

13 May 09:35
52b0f54

Choose a tag to compare

Highlights

1. DeepSeek V4 Flash-oriented integration: HMA + FAWA

  • HMA with FAWA: Strengthens HMA for large-model deployments (including DeepSeek V4 Flash–class setups). #953

2. Performance: Layerwise & Cache Store

  • Layerwise KV load patch (vLLM-Ascend v0.18.0 / DSA-oriented paths). #911
  • Pipeline-friendly shard task submission in CacheStore. #888
  • Buffer reservation to mitigate load starvation for high-priority traffic. #895
  • Full-graph replay for GSA sparse attention scenarios. #907
  • Backend-only KV cache load mode. #951
  • KV cache storage threshold and related storage controls. #925

3. Compress Store

  • UCM store compression module & compression config. #940
  • Compress Store logging & UX improvements. #949

4. SGLang

  • SGLang + UCM integration via dynamic backend loading. #886
  • SGLang quickstart docs & Dockerfile updates. #891

5. Other (stability / observability / tooling / engineering)

  • Profiling switch for performance testing. #875
  • DP fix: non-DP0 ranks updating file hotness correctly. #884
  • Connector metadata assert fix. #897
  • Duplicate metrics registration fix. #901
  • Hit-rate recording and follow-ups (1TP, partial-hit dump, etc.). #917 · #926 · #927
  • Trace generation support. #912
  • KV cache calculator: new model support. #899
  • TransBuffer / dump edge cases (reservation & ring bounds). #924
  • KVCacheLayout row-count fix. #920
  • MTP layer duplicate wait_for_layer_load fix. #932
  • Stricter save/load exception handling & cleanup. #929
  • vLLM 0.18.x–related patches (e.g. finalize_kv_cache, local_cache_hit). #934 · #941
  • Multi-node DP / multi-process GC fix. #937
  • Docs: distributed PD disaggregation on Ascend. #921
  • CI / packaging: Dockerfile refactor, metrics config in package, long e2e removed from PR gate. #902 · #916 · #947
  • Logging rate-limit switch, TPOT test + config updates. #931 · #944
  • Docs: quickstart & supported compute platforms. #890 · #896
  • Release housekeeping: 0.5.0rc2 & bump to 0.5.0. #915 · #956
  • Packaging (__init__.py, etc.). #935 · #943

Full Changelog: v0.5.0rc1...v0.5.0

What's Changed

  • [Feat]: add profiling switch for performance test by @harrisonyhq in #875
  • [BugFix] Fix the bug where non-DP0 processes fail to update file hotness in DP scenarios. by @UESTC-AHao in #884
  • [Perf] Pipeline-friendly shard task submission in CacheStore by @mag1c-h in #888
  • [doc] Update quickstart and calculator config by @yuanzhg078 in #890
  • [doc] Update supported compute platforms by @yuanzhg078 in #896
  • [Bugfix] Fix the bug of connector matadata assert failure by @sumingZero in #897
  • [Fix] Duplicate metrics registration by @flesher0813 in #901
  • [opt] Prevent Load starvation by reserving buffers for high-priority operations by @mag1c-h in #895
  • [CI] refactor dockerfile by @dante159753 in #902
  • [opt] Support full-graph replay for GSA-enabled sparse attention by @wangwenxin0312 in #907
  • feat(kv_cache_calculator): Add New Model Supports by @Potterluo in #899
  • [doc] Update SGLang quickstart docs & Update SGLang Dockerfile by @pyxyzc in #891
  • [Feature] Add SGLang UCM integration via dynamic backend loading by @pyxyzc in #886
  • [Opt] Support generate traces by @flesher0813 in #912
  • release 0.5.0rc2 by @qyh111 in #915
  • [CI] add metrics config into package by @dante159753 in #916
  • [Feat] add layerwise KV load patch for vllm-ascend v0.18.0 on DSA model by @sumingZero in #911
  • [Feat] Support recording hit rate by @flesher0813 in #917
  • [Fix]TransBuffer: shared freeHead let Dump bypass the reserved limit and ring bounds by @qyh111 in #924
  • [Opt] Support setting a threshold for kv_cache storage by @flesher0813 in #925
  • [Fix] Not print hit rate when using 1tp by @flesher0813 in #926
  • [Fix] Fix dump err when 0 < len(load_blocks) < fully hit by @flesher0813 in #927
  • [Fix] Update KVCacheLayout to set row counts. by @hmy98213 in #920
  • [bugfix]fix mtp_layer calls wait_for_layer_load multiple times by @qyh111 in #932
  • [Fix] Strengthen save/load exception check and remove unused codes by @flesher0813 in #929
  • [fix]Add patch for finalize_kv_cache in 0.18.0rc1 by @qyh111 in #934
  • Add init.py by @qyh111 in #935
  • [BugFix] Fix the issue of multiple processes enabling GC in multi-node DP scen… by @UESTC-AHao in #937
  • [feat] Add switch to log rate limit by @dante159753 in #931
  • [BugFix] Add patch for local_cache_hit calculation in vllm 0.18.0 met… by @sumingZero in #941
  • [Doc] Distributed PD Disaggregation on Ascend by @sumingZero in #921
  • [Bugfix] Add init.py by @sumingZero in #943
  • [CI] Remove long-running e2e tests from PR gate by @mag1c-h in #947
  • feat(test): Add TPOT metric calculation and update test configuration by @Potterluo in #944
  • [Feat] Add UCM store compression module and compression config parametersF...
Read more

v0.5.0rc1

28 Apr 08:05
82467ce

Choose a tag to compare

v0.5.0rc1 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.4.0...v0.5.0rc1

v0.4.0

20 Mar 10:13
bff8e9b

Choose a tag to compare

Highlight

Support SGLang

Refactor PipelineStore for Scalability and Performance

  • Refactor PipelineStore into a modular, plugin-based architecture with automatic registration and runtime loading (#689)
  • Improve overall performance through optimized Store implementations (e.g., cache store, posix store) and execution flow (#722, #744, #787)

UCM Connector

  • UCM now additionally supports advanced parallel paradigms, including PCP / DCP and PP, enabling more flexible and scalable distributed execution (#750)
  • Improve UCM connector performance by introducing optional event synchronization control (#768)

Inference Enhancement Features

  • GSAOnDevice sparse attention algorithm has been upgraded with improved performance and accuracy, now fully supporting vLLM / vLLM-Ascend 0.11.0 (#659, #746, #729)
  • Add support for Rerope in vLLM version 0.11.0 (#686)
  • Enhance UCM logger compatibility (#760)

Document

What's Changed

  • [fix] Adapt ESA to the LayerWiseConnector by @wangwenxin0312 in #681
  • [doc] Add Code of Conduct by @yuanzhg078 in #684
  • [Opt] GsaOnDevice cuda bugfix & optimization by @wangwenxin0312 in #659
  • [CI] Modify pull request template by @yuanzhg078 in #687
  • [Feature] rewrite logger module by @Lijiachen1018 in #608
  • [refactor] Rename global rank and remove broadcast function by @harrisonyhq in #685
  • [fix] clean code by @Lijiachen1018 in #688
  • [opt] Refactor PipelineStore for Enhanced Scalability by @mag1c-h in #689
  • [fix] Fix logger by @Lijiachen1018 in #690
  • [doc] How to extend UCM Store by @mag1c-h in #692
  • [CI] logger use zlibstatic by @Lijiachen1018 in #698
  • [Bugfix] Cherry-pick modify worker_id to distinguish diff workers(#691) by @flesher0813 in #701
  • [bugfix] rm unavailable lib and fix doc and update patch by @wuhuxiao in #699
  • [feat] rerope feature for vllm0.11.0 by @xinSky00 in #686
  • [perf] Reduce directory lock conflicts during batch dumps in PosixStore by @mag1c-h in #707
  • [bugfix] fix debug log printing by @Lijiachen1018 in #706
  • [bugfix] Fixed the issue of invalid LocalBuffer pointers in PCStore by @mag1c-h in #715
  • [bugfix] rerope feature for vllm0.9.2 and git apply merging by @xinSky00 in #708
  • [CICD] run e2e test in docker by @dante159753 in #712
  • [Feature] Add readme and dataset in performance and evaluation test by @zzycode1005 in #721
  • [bugfix] Adaptive modification of llmperf by @Menglths in #719
  • [Feat] sparse patch for vllm-ascend v0.11.0 by @Infinite666 in #718
  • [Bugfix]Fix garbled output when tp > 1 by @qyh111 in #716
  • [perf] Copy Bandwidth Optimize: Multi-Stream parallelism supported in CacheStore by @mag1c-h in #722
  • [Feat] sparse patch for gsa on device(GQA) va0.11.0rc1 by @Infinite666 in #726
  • [Feat]Add layerwise and log_path config in run.sh by @qyh111 in #724
  • [opt] Default depth of the waiting queue needs to be increased by nShard times for layer-wise by @mag1c-h in #731
  • [Feat] Reuse-aware layer skipping under dynamic KV sparsification by @tedi20 in #725
  • [opt] Increase the default running queue depth to support greater concurrent requests. by @qyh111 in #733
  • [Feat]: Monkey patch framework for vllm 0.11.0, fix graph mode + UCM bugs by @NaganooMei in #735
  • [Feat] Add csrc/ascend NPU custom ops for GSA by @leideng in #729
  • [feat] Variable length IO supported in CacheStore by @mag1c-h in #734
  • [Opt] Enable concurrent prefix lookup for posixstore by @sumingZero in #739
  • [CI] refine docker file to use in yellow field by @dante159753 in #741
  • [Feat]: Implement load failure recovery via monkey patch by @NaganooMei in #738
  • [Opt]Split the thread pool into separate load and dump pools to prevent them from interfering with each other. by @qyh111 in #744
  • [opt] Print TaskId in the CacheStore Error Log by @mag1c-h in #742
  • [opt]Adapt variable io size by @qyh111 in #745
  • [Opt] Add log timestamp in run_vllm.sh by @qyh111 in #747
  • [bugfix & opt] gsaOnDevice for CUDA Graph mode by @wangwenxin0312 in #732
  • [test & bugfix] fix low dump performance in posixstore e2e test by @NaganooMei in #751
  • [Fix] Modify the config files of gsaondevice. by @AooooooA-C in #749
  • [Test] Remove memory manager abstraction in PosixStore e2e test by @NaganooMei in #753
  • [opt] CUDA Hamming Distance Kernel Optimization for GQA by @wangwenxin0312 in #755
  • [fix] fix zlib gitcode url by @Lijiachen1018 in #758
  • [Feature] Integrate UnifiedCache (UCM) into SGLang for Multi-Level Caching System by @pyxyzc in #757
  • chore(test): Ensure that unnecessary import failures do not affect test execution by @Potterluo in #754
  • [feat] GSAOnDevice for MLA Models Like DeepSeek V2/V3 in Ascend NPU by @leideng in #746
  • [Feat] sparse patch for gsa on device(MLA) va0.11.0 by @Infinite666 in #761
  • [Fix] fix save_speed core dump and loaded blocks num when task failed by @flesher0813 in #763
  • Fix batch_size_for_hamming bug when slice is disabled (vllm-ascend 0.11.0) by @leideng in #765
  • [Feat] adapt dcp&pcp by @flesher0813 in #750
  • [Fix] Add init.py for rerope. by @AooooooA-C in #769
  • [Refactor]monkey patch sparse feature in v0.11.0 by @ayaka836 in #743
  • [Opt] update deepseek r1 config by @leideng in #770
  • [feat] Introduce platform-specific sparse trigger thresholds for GPU and NPU by @wangwenxin0312 in #762
  • [opt] Define UCM_ROOT_DIR to ensure safety when used UCM as a sub-repository by @mag1c-h in #772
  • [opt] enable Ascend register pin optimization by @mag1c-h in #775
  • [fix] remove imports that specific to platform by @dante159753 in #771
  • [opt] supports lo...
Read more

v0.3.0

30 Jan 08:47
8dd98d1

Choose a tag to compare

HighLights

  • Refinement of PipelineStore Architecture and Enhancement of Core Capabilities #653 #711
  • Now supports 3FS for scalable and efficient storage backends #622
  • Features the new GSAOnDevice sparse attention algorithm, enabling high-performance HBM utilization across both CUDA and Ascend platforms.#647 #638
  • Aligned CacheBlend with the new UCM storage and sparse engine updates to support vLLM 0.9.2. #664

Known Issues

  • Layerwise is not supported when using vllm 0.11.0
    • Currently, installing with pip install uc-manager does not support using vllm 0.11.0.
    • If you need to use vLLM 0.11.0+ with UCM layerwise, please refer to vllm-project/vllm#26675 for modifications.

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.3.0

v0.2.0

05 Jan 12:28
39d46c7

Choose a tag to compare

Hightlights

  • Support Model Window Extrapolation:Rectified Rotary Position Embeddings (ReRoPE)(#497)
  • Support sparse attention algorithms in HBM on both CUDA GPUs and Ascend NPUs. It sparsifies attention by hashing KV states and using Hamming distance Top-K selection.(#559)
  • Add Pipeline Store composed of Cache Store and POSIX Store(#553).
  • Improved KV cache transfer performance for NfsStore.(#393)

Known Issues

  • Sparse is not supported when installing via pip
    • Currently, installing with pip install uc-manager does not support Sparse.
    • Before installing via pip, please make sure to set the platform explicitly:
      export PLATFORM=xxx
    • To use Sparse, please install via the Docker image or build from source.

What's Changed

New Contributors

Full Changelog: v0.1.2...v0.2.0

v0.2.0rc1

13 Dec 13:43
bad9354

Choose a tag to compare

v0.2.0rc1 Pre-release
Pre-release

Hightlights

  • Improved Prefix Cache offload/load performance.
  • Support Cache Blend.

Core:

  • Support Cache Blend in (#467)
  • Add V1 Store Interface in (#510) and (#518)

Known Issues

  • When using the Ascend platform:
    • Broadcasting is not supported.
    • load_only_first_rank must be set to false in the configuration.
  • When compiling from source, make sure to set the PLATFORM environment variable.

What's Changed

New Contributors

Full Changelog: v0.1.2...v0.2.0rc1

v0.1.2

10 Dec 07:56
aa31619

Choose a tag to compare

Some small fixes in this release.

  • [Docs] Documents are now easier to read.
  • [Docs] PD disaggregation documentation update : Update the PD disaggregation documentation to remove the --enforce-eager argument when starting the vllm service, so that graph mode is enabled by default at startup.
  • [Feat] Completely remove UCconnector, please use UCMConnector from now on.
  • [Feat] UCM supports recovery form load failure:Implement the get_block_ids_with_load_errors interface in the KVConnectorBase_V1 class, enabling vLLM to reexecute inference for requests whose KV cache failed to load from UCM.
  • [Build] Use pip install uc-manager==0.1.2 and the install will build from source for both vllm and vllm-ascend.
  • [Build] Sparse module are now built and used only if set environment variable export ENABLE_SPARSE=TRUE.

What's Changed

New Contributors

Full Changelog: v0.1.0...v0.1.2

v0.1.0

02 Dec 08:42
5ba2684

Choose a tag to compare

We are excited to announce the first official release of Unified Cache Manager.

Hightlights

  • Offload Prefix Cache to storage.
  • Homogeneous/ Heterogeneos PD disaggregation.
  • Training-Free sparsity in accelerating inference.(vllm==0.9.2, vllm-ascend==0.9.2rc1)in #199, #335, #190, #451

Core:

  • Garbage collection for store in #315 and #312
  • Adapt to vllm and vllm-ascend in #13, #292, #415 and #362
  • UCM supports metrics display online via Grafana and Promethues in #414 and docs in #416

Known Issues

If using Ascend platform, please be mind of

  • not compatible with broadcast
  • load_only_first_rank: false in config

Others

  • Update documents
  • Tools for performance tuning, hyperparameter optimization in #418

What's Changed

New Contributors

Full Changelog: v0.1.0rc4...v0.1.0

v0.1.0rc4

22 Nov 10:16
5779ce9

Choose a tag to compare

v0.1.0rc4 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.1.0rc2...v0.1.0rc4

v0.1.0rc2

19 Nov 08:01
16ed5da

Choose a tag to compare

v0.1.0rc2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.1.0rc1...v0.1.0rc2