[PD Disaggregation] Prefill and decode support cache storage by juncaipeng · Pull Request #6768 · PaddlePaddle/FastDeploy

juncaipeng · 2026-03-10T13:48:52Z

Motivation

Prefill and decode support cache storage

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

PreifxCacheManager
ResourceManager

Usage or Command

Refer to examples/cache_storage/run_03b_pd.sh

Accuracy Tests

None

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-10T13:49:02Z

Thanks for your contribution!

codecov-commenter · 2026-03-10T18:07:17Z

Codecov Report

❌ Patch coverage is 47.05882% with 27 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@b0fd242). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/cache_manager/prefix_cache_manager.py	35.29%	21 Missing and 1 partial ⚠️
fastdeploy/engine/sched/resource_manager_v1.py	70.58%	0 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6768   +/-   ##
==========================================
  Coverage           ?   72.28%           
==========================================
  Files              ?      394           
  Lines              ?    54297           
  Branches           ?     8508           
==========================================
  Hits               ?    39248           
  Misses             ?    12241           
  Partials           ?     2808

Flag	Coverage Δ
GPU	`72.28% <47.05%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

该 PR 旨在在 PD Disaggregation（Prefill/Decode 分离部署） 场景下补齐 KV cache 写回到外部存储（storage backend） 的能力，尤其是让 Decode 实例在不依赖 Radix Tree 的情况下也能完成 cache 落盘，从而支持跨实例/跨轮次复用缓存。

Changes:

在 PrefixCacheManager 新增 Decode 场景的简化写回方法 write_cache_to_storage_decode()，通过 token_ids 直接计算链式 hash keys 并写入 storage。
在 ResourceManagerV1 中按 splitwise role 增加保护条件，避免 Decode 实例执行依赖 Radix Tree 的 prefix/output cache 更新与释放逻辑，并在请求结束时调用 decode 写回方法。
移除 decode 角色下强制关闭 enable_prefix_caching 的参数后处理逻辑，并新增 PD + storage 的示例脚本。

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
fastdeploy/engine/sched/resource_manager_v1.py	为 splitwise decode 增加 cache 相关逻辑的 role 保护，并在 finish 时区分 P/D 写回路径
fastdeploy/engine/args_utils.py	移除 decode 角色强制关闭 enable_prefix_caching 的逻辑，以允许 decode 侧启用 storage backend
fastdeploy/cache_manager/prefix_cache_manager.py	新增 `write_cache_to_storage_decode()`，让 decode 侧不依赖 Radix Tree 也能生成 keys 并写回 storage
examples/cache_storage/run_03b_pd.sh	增加 PD 分离 + Mooncake storage 的端到端示例脚本

Copilot · 2026-03-11T09:23:49Z

fastdeploy/cache_manager/prefix_cache_manager.py

+        for i in range(0, len(token_ids), block_size):
+            block_token_ids = token_ids[i : i + block_size]
+            if len(block_token_ids) < block_size:
+                break  # Do not cache incomplete block
+
+            # Calculate hash key for current block
+            key = get_hash_str(block_token_ids, prefix_block_key)


write_cache_to_storage_decode 里 key 的生成只做了 chained hash（prefix_block_key），但 PrefixCacheManager 在多模态场景会通过 get_block_hash_extra_keys() 把 mm_hashes 等 extra_keys 纳入 hash（见 mm_build_path/mm_match_block）。这里不处理 extra_keys 会导致多模态请求在 Decode 侧写入的 storage key 与 Prefill 侧读取/匹配不一致，从而无法命中缓存。建议复用 get_block_hash_extra_keys 的逻辑并维护 mm_idx/prefix_block_key，使 key 生成与 mm_build_path 保持一致。

Suggested change

for i in range(0, len(token_ids), block_size):

block_token_ids = token_ids[i : i + block_size]

if len(block_token_ids) < block_size:

break # Do not cache incomplete block

# Calculate hash key for current block

key = get_hash_str(block_token_ids, prefix_block_key)

# Try to reuse multimodal extra keys for hash, keeping compatibility

extra_keys_map = {}

if hasattr(self, "get_block_hash_extra_keys"):

try:

extra_keys = self.get_block_hash_extra_keys(request)

if isinstance(extra_keys, dict):

extra_keys_map = extra_keys

elif isinstance(extra_keys, (list, tuple)):

extra_keys_map = {idx: v for idx, v in enumerate(extra_keys)}

except TypeError:

# Backward compatibility: ignore extra keys if signature mismatch

extra_keys_map = {}

for i in range(0, len(token_ids), block_size):

block_token_ids = token_ids[i : i + block_size]

if len(block_token_ids) < block_size:

break # Do not cache incomplete block

# Calculate hash key for current block, including extra keys if any

block_idx = i // block_size

block_extra_keys = extra_keys_map.get(block_idx)

if block_extra_keys is None:

block_extra_keys = []

key_prefix = prefix_block_key + list(block_extra_keys)

key = get_hash_str(block_token_ids, key_prefix)

fastdeploy/cache_manager/prefix_cache_manager.py

Copilot · 2026-03-11T09:23:50Z