[PD Disaggregation] Prefill and decode support cache storage#6768
[PD Disaggregation] Prefill and decode support cache storage#6768juncaipeng wants to merge 2 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #6768 +/- ##
==========================================
Coverage ? 72.28%
==========================================
Files ? 394
Lines ? 54297
Branches ? 8508
==========================================
Hits ? 39248
Misses ? 12241
Partials ? 2808
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
该 PR 旨在在 PD Disaggregation(Prefill/Decode 分离部署) 场景下补齐 KV cache 写回到外部存储(storage backend) 的能力,尤其是让 Decode 实例在不依赖 Radix Tree 的情况下也能完成 cache 落盘,从而支持跨实例/跨轮次复用缓存。
Changes:
- 在
PrefixCacheManager新增 Decode 场景的简化写回方法write_cache_to_storage_decode(),通过 token_ids 直接计算链式 hash keys 并写入 storage。 - 在
ResourceManagerV1中按 splitwise role 增加保护条件,避免 Decode 实例执行依赖 Radix Tree 的 prefix/output cache 更新与释放逻辑,并在请求结束时调用 decode 写回方法。 - 移除 decode 角色下强制关闭
enable_prefix_caching的参数后处理逻辑,并新增 PD + storage 的示例脚本。
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| fastdeploy/engine/sched/resource_manager_v1.py | 为 splitwise decode 增加 cache 相关逻辑的 role 保护,并在 finish 时区分 P/D 写回路径 |
| fastdeploy/engine/args_utils.py | 移除 decode 角色强制关闭 enable_prefix_caching 的逻辑,以允许 decode 侧启用 storage backend |
| fastdeploy/cache_manager/prefix_cache_manager.py | 新增 write_cache_to_storage_decode(),让 decode 侧不依赖 Radix Tree 也能生成 keys 并写回 storage |
| examples/cache_storage/run_03b_pd.sh | 增加 PD 分离 + Mooncake storage 的端到端示例脚本 |
| for i in range(0, len(token_ids), block_size): | ||
| block_token_ids = token_ids[i : i + block_size] | ||
| if len(block_token_ids) < block_size: | ||
| break # Do not cache incomplete block | ||
|
|
||
| # Calculate hash key for current block | ||
| key = get_hash_str(block_token_ids, prefix_block_key) |
There was a problem hiding this comment.
write_cache_to_storage_decode 里 key 的生成只做了 chained hash(prefix_block_key),但 PrefixCacheManager 在多模态场景会通过 get_block_hash_extra_keys() 把 mm_hashes 等 extra_keys 纳入 hash(见 mm_build_path/mm_match_block)。这里不处理 extra_keys 会导致多模态请求在 Decode 侧写入的 storage key 与 Prefill 侧读取/匹配不一致,从而无法命中缓存。建议复用 get_block_hash_extra_keys 的逻辑并维护 mm_idx/prefix_block_key,使 key 生成与 mm_build_path 保持一致。
| for i in range(0, len(token_ids), block_size): | |
| block_token_ids = token_ids[i : i + block_size] | |
| if len(block_token_ids) < block_size: | |
| break # Do not cache incomplete block | |
| # Calculate hash key for current block | |
| key = get_hash_str(block_token_ids, prefix_block_key) | |
| # Try to reuse multimodal extra keys for hash, keeping compatibility | |
| extra_keys_map = {} | |
| if hasattr(self, "get_block_hash_extra_keys"): | |
| try: | |
| extra_keys = self.get_block_hash_extra_keys(request) | |
| if isinstance(extra_keys, dict): | |
| extra_keys_map = extra_keys | |
| elif isinstance(extra_keys, (list, tuple)): | |
| extra_keys_map = {idx: v for idx, v in enumerate(extra_keys)} | |
| except TypeError: | |
| # Backward compatibility: ignore extra keys if signature mismatch | |
| extra_keys_map = {} | |
| for i in range(0, len(token_ids), block_size): | |
| block_token_ids = token_ids[i : i + block_size] | |
| if len(block_token_ids) < block_size: | |
| break # Do not cache incomplete block | |
| # Calculate hash key for current block, including extra keys if any | |
| block_idx = i // block_size | |
| block_extra_keys = extra_keys_map.get(block_idx) | |
| if block_extra_keys is None: | |
| block_extra_keys = [] | |
| key_prefix = prefix_block_key + list(block_extra_keys) | |
| key = get_hash_str(block_token_ids, key_prefix) |
| # ======================== 清理和准备 ======================== | ||
| unset http_proxy && unset https_proxy | ||
| rm -rf log_* | ||
| find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true |
There was a problem hiding this comment.
这里直接清理 /dev/shm 下所有文件会误删与 FastDeploy/Mooncake 无关的共享内存文件,在多人机器或同机跑其它服务时风险很高。建议把清理范围限制到本脚本创建/使用的前缀或明确的文件模式,或提供可选开关让用户自行决定是否执行该清理。
| find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true | |
| # Only clean shared memory files created by this service by default. | |
| # Set CLEAN_ALL_DEV_SHM=1 if you really want to remove all files under /dev/shm (use with caution). | |
| if [[ "${CLEAN_ALL_DEV_SHM:-0}" == "1" ]]; then | |
| find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true | |
| else | |
| find /dev/shm -maxdepth 1 -type f \( -name "mooncake_*" -o -name "fastdeploy_*" \) -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true | |
| fi |
| # ======================== 清理和准备 ======================== | ||
| unset http_proxy && unset https_proxy | ||
| rm -rf log_* | ||
| find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true | ||
| bash stop.sh 2>/dev/null || true | ||
|
|
||
| source ./utils.sh |
There was a problem hiding this comment.
脚本里通过相对路径调用 stop.sh / utils.sh(bash stop.sh、source ./utils.sh),要求必须在特定工作目录下执行;但当前目录并没有 stop.sh(stop.sh 位于 examples/splitwise 或 fastdeploy 目录),容易导致示例无法运行。建议基于脚本自身目录解析路径(而不是依赖 cwd),并指向正确的 stop.sh 位置或在本目录提供对应脚本。
| # ======================== 清理和准备 ======================== | |
| unset http_proxy && unset https_proxy | |
| rm -rf log_* | |
| find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true | |
| bash stop.sh 2>/dev/null || true | |
| source ./utils.sh | |
| # 基于脚本自身目录解析相对路径 | |
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | |
| # ======================== 清理和准备 ======================== | |
| unset http_proxy && unset https_proxy | |
| rm -rf log_* | |
| find /dev/shm -type f -print0 2>/dev/null | xargs -0 rm -f 2>/dev/null || true | |
| if [ -x "${SCRIPT_DIR}/stop.sh" ]; then | |
| bash "${SCRIPT_DIR}/stop.sh" 2>/dev/null || true | |
| fi | |
| if [ -f "${SCRIPT_DIR}/utils.sh" ]; then | |
| # shellcheck source=/dev/null | |
| source "${SCRIPT_DIR}/utils.sh" | |
| else | |
| echo "utils.sh not found in ${SCRIPT_DIR}, aborting." | |
| exit 1 | |
| fi |
| # ============================================================================= | ||
|
|
||
| # ======================== 环境变量配置 ======================== | ||
| export MODEL_NAME="/work/models/PaddlePaddle/ERNIE-4.5-0.3B-Paddle" |
There was a problem hiding this comment.
MODEL_NAME 被写死为本地绝对路径 /work/models/...,可移植性较差,和同目录 run.sh 使用的模型标识方式也不一致。建议改为与 run.sh 一致的默认值(例如 PaddlePaddle/ERNIE-4.5-0.3B-Paddle),或允许通过环境变量/参数覆盖并在脚本注释中说明。
| export MODEL_NAME="/work/models/PaddlePaddle/ERNIE-4.5-0.3B-Paddle" | |
| # MODEL_NAME can be overridden by environment variable before running this script | |
| : "${MODEL_NAME:=PaddlePaddle/ERNIE-4.5-0.3B-Paddle}" | |
| export MODEL_NAME |
| # ============================================================================= | ||
| # PD 分离 + 全局 Cache 池化测试脚本 | ||
| # 参考: start_v1_tp1.sh (PD 分离) + run.sh (Mooncake Cache 池化) | ||
| # 注意修改:PD实例的CUDA_VISIBLE_DEVICES环境变量 | ||
| # ============================================================================= |
There was a problem hiding this comment.
该示例脚本新增了大量中文注释(例如文件头部说明)。仓库同目录 README.md 及 run.sh 主要使用英文说明;为便于更广泛的用户/维护者阅读,建议把关键注释改为英文(保留必要的中文也可,但建议以英文为主)。
| --kvcache-storage-backend mooncake \ | ||
| 2>&1 > ${FD_LOG_DIR}/nohup & | ||
|
|
||
| # --kvcache-storage-backend mooncake \ |
There was a problem hiding this comment.
Decode 实例这里启用了 --kvcache-storage-backend,但没有显式传入 --enable-prefix-caching;当前之所以能跑主要依赖 EngineArgs.enable_prefix_caching 的默认值为 True。为避免默认值变更导致示例脚本直接报错(args_utils 里会校验 storage backend 需要 enable_prefix_caching=True),建议在 Decode 启动命令里显式加上 --enable-prefix-caching,并在注释里说明 Decode 侧仅用于启用 storage 写回。
| --kvcache-storage-backend mooncake \ | |
| 2>&1 > ${FD_LOG_DIR}/nohup & | |
| # --kvcache-storage-backend mooncake \ | |
| --enable-prefix-caching \ | |
| --kvcache-storage-backend mooncake \ | |
| 2>&1 > ${FD_LOG_DIR}/nohup & | |
| # Note: enable_prefix_caching is required when kvcache storage backend is enabled. | |
| # On Decode side, prefix caching is only used to enable storage write-back. |
Motivation
Prefill and decode support cache storage
Modifications
PreifxCacheManager
ResourceManager
Usage or Command
Refer to examples/cache_storage/run_03b_pd.sh
Accuracy Tests
None
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.