Skip to content

[PD Disaggregation] Prefill and decode support cache storage#6768

Open
juncaipeng wants to merge 2 commits intoPaddlePaddle:developfrom
juncaipeng:pd_cache_storage
Open

[PD Disaggregation] Prefill and decode support cache storage#6768
juncaipeng wants to merge 2 commits intoPaddlePaddle:developfrom
juncaipeng:pd_cache_storage

Conversation

@juncaipeng
Copy link
Collaborator

Motivation

Prefill and decode support cache storage

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

PreifxCacheManager
ResourceManager

Usage or Command

Refer to examples/cache_storage/run_03b_pd.sh

Accuracy Tests

None

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Mar 10, 2026

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 47.05882% with 27 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@b0fd242). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/cache_manager/prefix_cache_manager.py 35.29% 21 Missing and 1 partial ⚠️
fastdeploy/engine/sched/resource_manager_v1.py 70.58% 0 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6768   +/-   ##
==========================================
  Coverage           ?   72.28%           
==========================================
  Files              ?      394           
  Lines              ?    54297           
  Branches           ?     8508           
==========================================
  Hits               ?    39248           
  Misses             ?    12241           
  Partials           ?     2808           
Flag Coverage Δ
GPU 72.28% <47.05%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在在 PD Disaggregation(Prefill/Decode 分离部署) 场景下补齐 KV cache 写回到外部存储(storage backend) 的能力,尤其是让 Decode 实例在不依赖 Radix Tree 的情况下也能完成 cache 落盘,从而支持跨实例/跨轮次复用缓存。

Changes:

  • PrefixCacheManager 新增 Decode 场景的简化写回方法 write_cache_to_storage_decode(),通过 token_ids 直接计算链式 hash keys 并写入 storage。
  • ResourceManagerV1 中按 splitwise role 增加保护条件,避免 Decode 实例执行依赖 Radix Tree 的 prefix/output cache 更新与释放逻辑,并在请求结束时调用 decode 写回方法。
  • 移除 decode 角色下强制关闭 enable_prefix_caching 的参数后处理逻辑,并新增 PD + storage 的示例脚本。

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
fastdeploy/engine/sched/resource_manager_v1.py 为 splitwise decode 增加 cache 相关逻辑的 role 保护,并在 finish 时区分 P/D 写回路径
fastdeploy/engine/args_utils.py 移除 decode 角色强制关闭 enable_prefix_caching 的逻辑,以允许 decode 侧启用 storage backend
fastdeploy/cache_manager/prefix_cache_manager.py 新增 write_cache_to_storage_decode(),让 decode 侧不依赖 Radix Tree 也能生成 keys 并写回 storage
examples/cache_storage/run_03b_pd.sh 增加 PD 分离 + Mooncake storage 的端到端示例脚本

Comment on lines +1154 to +1160
for i in range(0, len(token_ids), block_size):
block_token_ids = token_ids[i : i + block_size]
if len(block_token_ids) < block_size:
break # Do not cache incomplete block

# Calculate hash key for current block
key = get_hash_str(block_token_ids, prefix_block_key)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_cache_to_storage_decode 里 key 的生成只做了 chained hash(prefix_block_key),但 PrefixCacheManager 在多模态场景会通过 get_block_hash_extra_keys() 把 mm_hashes 等 extra_keys 纳入 hash(见 mm_build_path/mm_match_block)。这里不处理 extra_keys 会导致多模态请求在 Decode 侧写入的 storage key 与 Prefill 侧读取/匹配不一致,从而无法命中缓存。建议复用 get_block_hash_extra_keys 的逻辑并维护 mm_idx/prefix_block_key,使 key 生成与 mm_build_path 保持一致。

Suggested change
for i in range(0, len(token_ids), block_size):
block_token_ids = token_ids[i : i + block_size]
if len(block_token_ids) < block_size:
break # Do not cache incomplete block
# Calculate hash key for current block
key = get_hash_str(block_token_ids, prefix_block_key)
# Try to reuse multimodal extra keys for hash, keeping compatibility
extra_keys_map = {}
if hasattr(self, "get_block_hash_extra_keys"):
try:
extra_keys = self.get_block_hash_extra_keys(request)
if isinstance(extra_keys, dict):
extra_keys_map = extra_keys
elif isinstance(extra_keys, (list, tuple)):
extra_keys_map = {idx: v for idx, v in enumerate(extra_keys)}
except TypeError:
# Backward compatibility: ignore extra keys if signature mismatch
extra_keys_map = {}
for i in range(0, len(token_ids), block_size):
block_token_ids = token_ids[i : i + block_size]
if len(block_token_ids) < block_size:
break # Do not cache incomplete block
# Calculate hash key for current block, including extra keys if any
block_idx = i // block_size
block_extra_keys = extra_keys_map.get(block_idx)
if block_extra_keys is None:
block_extra_keys = []
key_prefix = prefix_block_key + list(block_extra_keys)
key = get_hash_str(block_token_ids, key_prefix)

Copilot uses AI. Check for mistakes.
# ======================== 清理和准备 ========================
unset http_proxy && unset https_proxy
rm -rf log_*
find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里直接清理 /dev/shm 下所有文件会误删与 FastDeploy/Mooncake 无关的共享内存文件,在多人机器或同机跑其它服务时风险很高。建议把清理范围限制到本脚本创建/使用的前缀或明确的文件模式,或提供可选开关让用户自行决定是否执行该清理。

Suggested change
find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true
# Only clean shared memory files created by this service by default.
# Set CLEAN_ALL_DEV_SHM=1 if you really want to remove all files under /dev/shm (use with caution).
if [[ "${CLEAN_ALL_DEV_SHM:-0}" == "1" ]]; then
find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true
else
find /dev/shm -maxdepth 1 -type f \( -name "mooncake_*" -o -name "fastdeploy_*" \) -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true
fi

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +39
# ======================== 清理和准备 ========================
unset http_proxy && unset https_proxy
rm -rf log_*
find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true
bash stop.sh 2>/dev/null || true

source ./utils.sh
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本里通过相对路径调用 stop.sh / utils.sh(bash stop.sh、source ./utils.sh),要求必须在特定工作目录下执行;但当前目录并没有 stop.sh(stop.sh 位于 examples/splitwise 或 fastdeploy 目录),容易导致示例无法运行。建议基于脚本自身目录解析路径(而不是依赖 cwd),并指向正确的 stop.sh 位置或在本目录提供对应脚本。

Suggested change
# ======================== 清理和准备 ========================
unset http_proxy && unset https_proxy
rm -rf log_*
find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true
bash stop.sh 2>/dev/null || true
source ./utils.sh
# 基于脚本自身目录解析相对路径
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# ======================== 清理和准备 ========================
unset http_proxy && unset https_proxy
rm -rf log_*
find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true
if [ -x "${SCRIPT_DIR}/stop.sh" ]; then
bash "${SCRIPT_DIR}/stop.sh" 2>/dev/null || true
fi
if [ -f "${SCRIPT_DIR}/utils.sh" ]; then
# shellcheck source=/dev/null
source "${SCRIPT_DIR}/utils.sh"
else
echo "utils.sh not found in ${SCRIPT_DIR}, aborting."
exit 1
fi

Copilot uses AI. Check for mistakes.
# =============================================================================

# ======================== 环境变量配置 ========================
export MODEL_NAME="/work/models/PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_NAME 被写死为本地绝对路径 /work/models/...,可移植性较差,和同目录 run.sh 使用的模型标识方式也不一致。建议改为与 run.sh 一致的默认值(例如 PaddlePaddle/ERNIE-4.5-0.3B-Paddle),或允许通过环境变量/参数覆盖并在脚本注释中说明。

Suggested change
export MODEL_NAME="/work/models/PaddlePaddle/ERNIE-4.5-0.3B-Paddle"
# MODEL_NAME can be overridden by environment variable before running this script
: "${MODEL_NAME:=PaddlePaddle/ERNIE-4.5-0.3B-Paddle}"
export MODEL_NAME

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +8
# =============================================================================
# PD 分离 + 全局 Cache 池化测试脚本
# 参考: start_v1_tp1.sh (PD 分离) + run.sh (Mooncake Cache 池化)
# 注意修改:PD实例的CUDA_VISIBLE_DEVICES环境变量
# =============================================================================
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该示例脚本新增了大量中文注释(例如文件头部说明)。仓库同目录 README.md 及 run.sh 主要使用英文说明;为便于更广泛的用户/维护者阅读,建议把关键注释改为英文(保留必要的中文也可,但建议以英文为主)。

Copilot uses AI. Check for mistakes.
Comment on lines +112 to +115
--kvcache-storage-backend mooncake \
2>&1 > ${FD_LOG_DIR}/nohup &

# --kvcache-storage-backend mooncake \
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decode 实例这里启用了 --kvcache-storage-backend,但没有显式传入 --enable-prefix-caching;当前之所以能跑主要依赖 EngineArgs.enable_prefix_caching 的默认值为 True。为避免默认值变更导致示例脚本直接报错(args_utils 里会校验 storage backend 需要 enable_prefix_caching=True),建议在 Decode 启动命令里显式加上 --enable-prefix-caching,并在注释里说明 Decode 侧仅用于启用 storage 写回。

Suggested change
--kvcache-storage-backend mooncake \
2>&1 > ${FD_LOG_DIR}/nohup &
# --kvcache-storage-backend mooncake \
--enable-prefix-caching \
--kvcache-storage-backend mooncake \
2>&1 > ${FD_LOG_DIR}/nohup &
# Note: enable_prefix_caching is required when kvcache storage backend is enabled.
# On Decode side, prefix caching is only used to enable storage write-back.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants