PaddlePaddle · juncaipeng · Mar 10, 2026 · Mar 11, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/README_CN.md b/README_CN.md
@@ -86,6 +86,7 @@ FastDeploy 支持在**英伟达（NVIDIA）GPU**、**昆仑芯（Kunlunxin）XPU
 - [前缀缓存](./docs/zh/features/prefix_caching.md)
 - [分块预填充](./docs/zh/features/chunked_prefill.md)
 - [负载均衡调度Router](./docs/zh/online_serving/router.md)
+- [全局Cache池化](./docs/zh/features/global_cache_pooling.md)
 
 ## 致谢
 

diff --git a/README_EN.md b/README_EN.md
@@ -84,6 +84,7 @@ Learn how to download models, enable using the torch format, and more:
 - [Prefix Caching](./docs/features/prefix_caching.md)
 - [Chunked Prefill](./docs/features/chunked_prefill.md)
 - [Load-Balancing Scheduling Router](./docs/online_serving/router.md)
+- [Global Cache Pooling](./docs/features/global_cache_pooling.md)
 
 ## Acknowledgement
 

diff --git a/docs/features/global_cache_pooling.md b/docs/features/global_cache_pooling.md
@@ -0,0 +1,368 @@
+[中文文档](../../zh/features/global_cache_pooling.md) | English
+
+# Global Cache Pooling
+
+This document describes how to use MooncakeStore as the KV Cache storage backend for FastDeploy, enabling **Global Cache Pooling** across multiple inference instances.
+
+## Overview
+
+### What is Global Cache Pooling?
+
+Global Cache Pooling allows multiple FastDeploy instances to share KV Cache through a distributed storage layer. This enables:
+
+- **Cross-instance cache reuse**: KV Cache computed by one instance can be reused by another
+- **PD Disaggregation optimization**: Prefill and Decode instances can share cache seamlessly
+- **Reduced computation**: Avoid redundant prefix computation across requests
+
+### Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                     Mooncake Master Server                       │
+│              (Metadata & Coordination Service)                   │
+└────────────────────────────┬────────────────────────────────────┘
+                             │
+         ┌───────────────────┼───────────────────┐
+         │                   │                   │
+         ▼                   ▼                   ▼
+┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
+│  FastDeploy     │ │  FastDeploy     │ │  FastDeploy     │
+│  Instance P     │ │  Instance D     │ │  Instance X     │
+│  (Prefill)      │ │  (Decode)       │ │  (Standalone)   │
+└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
+         │                   │                   │
+         └───────────────────┼───────────────────┘
+                             │
+                    ┌────────▼────────┐
+                    │  MooncakeStore  │
+                    │  (Shared KV     │
+                    │   Cache Pool)   │
+                    └─────────────────┘
+```
+
+## Example Scripts
+
+Ready-to-use example scripts are available in [examples/cache_storage/](../../../examples/cache_storage/).
+
+| Script | Scenario | Description |
+|--------|----------|-------------|
+| `run.sh` | Multi-Instance | Two standalone instances sharing cache |
+| `run_03b_pd_storage.sh` | PD Disaggregation | P+D instances with global cache pooling |
+
+## Prerequisites
+
+### Hardware Requirements
+
+- NVIDIA GPU with CUDA support
+- RDMA network (recommended for production) or TCP
+
+### Software Requirements
+
+- Python 3.8+
+- CUDA 11.8+
+- FastDeploy (see installation below)
+
+## Installation
+
+Refer to [NVIDIA CUDA GPU Installation](https://paddlepaddle.github.io/FastDeploy/get_started/installation/nvidia_gpu/) for FastDeploy installation.
+
+```bash
+# Option 1: Install from PyPI
+pip install fastdeploy-gpu
+
+# Option 2: Build from source
+bash build.sh
+pip install ./dist/fastdeploy*.whl
+```
+
+MooncakeStore is automatically installed when you install FastDeploy.
+
+## Configuration
+
+We support two ways to configure MooncakeStore: via configuration file `mooncake_config.json` or via environment variables.
+
+### Mooncake Configuration File
+
+Create a `mooncake_config.json` file:
+
+```json
+{
+    "metadata_server": "http://0.0.0.0:15002/metadata",
+    "master_server_addr": "0.0.0.0:15001",
+    "global_segment_size": 1000000000,
+    "local_buffer_size": 134217728,
+    "protocol": "rdma",
+    "rdma_devices": ""
+}
+```
+
+Set the `MOONCAKE_CONFIG_PATH` environment variable to enable the configuration:
+
+```bash
+export MOONCAKE_CONFIG_PATH=path/to/mooncake_config.json
+```
+
+Configuration parameters:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `metadata_server` | HTTP metadata server URL | Required |
+| `master_server_addr` | Master server address | Required |
+| `global_segment_size` | Memory space each TP process shares to global shared memory (bytes) | 1GB |
+| `local_buffer_size` | Local buffer size for data transfer (bytes) | 128MB |
+| `protocol` | Transfer protocol: `rdma` or `tcp` | `rdma` |
+| `rdma_devices` | RDMA device names (comma-separated) | Auto-detect |
+
+### Environment Variables
+
+Mooncake can also be configured via environment variables:
+
+| Variable | Description |
+|----------|-------------|
+| `MOONCAKE_MASTER_SERVER_ADDR` | Master server address (e.g., `10.0.0.1:15001`) |
+| `MOONCAKE_METADATA_SERVER` | Metadata server URL |
+| `MOONCAKE_GLOBAL_SEGMENT_SIZE` | Memory space each TP process shares to global shared memory (bytes) |
+| `MOONCAKE_LOCAL_BUFFER_SIZE` | Local buffer size (bytes) |
+| `MOONCAKE_PROTOCOL` | Transfer protocol (`rdma` or `tcp`) |
+| `MOONCAKE_RDMA_DEVICES` | RDMA device names |
+
+## Usage Scenarios
+
+### Scenario 1: Multi-Instance Cache Sharing
+
+Run multiple FastDeploy instances sharing a global KV Cache pool.
+
+**Step 1: Start Mooncake Master**
+
+```bash
+mooncake_master \
+    --port=15001 \
+    --enable_http_metadata_server=true \
+    --http_metadata_server_host=0.0.0.0 \
+    --http_metadata_server_port=15002 \
+    --metrics_port=15003
+```
+
+**Step 2: Start FastDeploy Instances**
+
+Instance 0:
+```bash
+export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
+export CUDA_VISIBLE_DEVICES=0
+
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model ${MODEL_NAME} \
+       --port 52700 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --kvcache-storage-backend mooncake
+```
+
+Instance 1:
+```bash
+export MOONCAKE_CONFIG_PATH="./mooncake_config.json"
+export CUDA_VISIBLE_DEVICES=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model ${MODEL_NAME} \
+       --port 52800 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --kvcache-storage-backend mooncake
+```
+
+**Step 3: Test Cache Reuse**
+
+Send the same prompt to both instances. The second instance should reuse the KV Cache computed by the first instance.
+
+```bash
+# Request to Instance 0
+curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'
+
+# Request to Instance 1 (should hit cached KV)
+curl -X POST "http://0.0.0.0:52800/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "Hello, world!"}], "max_tokens": 50}'
+```
+
+### Scenario 2: PD Disaggregation with Global Cache
+
+This scenario combines **PD Disaggregation** with **Global Cache Pooling**, enabling:
+
+- Prefill instances to read Decode instances' output cache
+- Optimal multi-turn conversation performance
+
+**Architecture:**
+
+```
+         ┌──────────────────────────────────────────┐
+         │              Router                       │
+         │         (Load Balancer)                   │
+         └─────────────────┬────────────────────────┘
+                           │
+           ┌───────────────┴───────────────┐
+           │                               │
+           ▼                               ▼
+    ┌─────────────┐                 ┌─────────────┐
+    │   Prefill   │                 │   Decode    │
+    │  Instance   │◄───────────────►│  Instance   │
+    │             │   KV Transfer   │             │
+    └──────┬──────┘                 └──────┬──────┘
+           │                               │
+           └───────────────┬───────────────┘
+                           │
+                  ┌────────▼────────┐
+                  │  MooncakeStore  │
+                  │  (Global Cache) │
+                  └─────────────────┘
+```
+
+**Step 1: Start Mooncake Master**
+
+```bash
+mooncake_master \
+    --port=15001 \
+    --enable_http_metadata_server=true \
+    --http_metadata_server_host=0.0.0.0 \
+    --http_metadata_server_port=15002
+```
+
+**Step 2: Start Router**
+
+```bash
+python -m fastdeploy.router.launch \
+    --port 52700 \
+    --splitwise
+```
+
+**Step 3: Start Prefill Instance**
+
+```bash
+export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
+export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
+export MOONCAKE_PROTOCOL="rdma"
+export CUDA_VISIBLE_DEVICES=0
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${MODEL_NAME} \
+    --port 52400 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --splitwise-role prefill \
+    --cache-transfer-protocol rdma \
+    --router "0.0.0.0:52700" \
+    --kvcache-storage-backend mooncake
+```
+
+**Step 4: Start Decode Instance**
+
+```bash
+export MOONCAKE_MASTER_SERVER_ADDR="127.0.0.1:15001"
+export MOONCAKE_METADATA_SERVER="http://127.0.0.1:15002/metadata"
+export MOONCAKE_PROTOCOL="rdma"
+export CUDA_VISIBLE_DEVICES=1
+
+python -m fastdeploy.entrypoints.openai.api_server \
+    --model ${MODEL_NAME} \
+    --port 52500 \
+    --max-model-len 32768 \
+    --max-num-seqs 32 \
+    --splitwise-role decode \
+    --cache-transfer-protocol rdma \
+    --router "0.0.0.0:52700" \
+    --enable-output-caching \
+    --kvcache-storage-backend mooncake
+```
+
+**Step 5: Send Requests via Router**
+
+```bash
+curl -X POST "http://0.0.0.0:52700/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50}'
+```
+
+## FastDeploy Parameters for Mooncake
+
+| Parameter | Description |
+|-----------|-------------|
+| `--kvcache-storage-backend mooncake` | Enable Mooncake as KV Cache storage backend |
+| `--enable-output-caching` | Enable output token caching (Decode instance recommended) |
+| `--cache-transfer-protocol rdma` | Use RDMA for KV transfer between P and D |
+| `--splitwise-role prefill/decode` | Set instance role in PD disaggregation |
+| `--router` | Router address for PD disaggregation |
+
+## Verification
+
+### Check Cache Hit
+
+To verify cache hit in logs:
+
+```bash
+# For multi-instance scenario
+grep -E "storage_cache_token_num" log_*/api_server.log
+
+# For PD disaggregation scenario
+grep -E "storage_cache_token_num" log_prefill/api_server.log
+```
+
+If `storage_cache_token_num > 0`, the instance successfully read cached KV blocks from the global pool.
+
+### Monitor Mooncake Master
+
+```bash
+# Check master status
+curl http://localhost:15002/metadata
+
+# Check metrics (if metrics_port is configured)
+curl http://localhost:15003/metrics
+```
+
+## Troubleshooting
+
+### Common Issues
+
+**1. Port Already in Use**
+
+```bash
+# Check port usage
+ss -ltn | grep 15001
+
+# Kill existing process
+kill -9 $(lsof -t -i:15001)
+```
+
+**2. RDMA Connection Failed**
+
+- Verify RDMA devices: `ibv_devices`
+- Check RDMA network: `ibv_devinfo`
+- Fallback to TCP: Set `MOONCAKE_PROTOCOL=tcp`
+
+**3. Cache Not Being Shared**
+
+- Verify all instances connect to the same Mooncake master
+- Check metadata server URL is consistent
+- Verify `global_segment_size` is large enough
+
+**4. Permission Denied on /dev/shm**
+
+```bash
+# Clean up stale shared memory files
+find /dev/shm -type f -print0 | xargs -0 rm -f
+```
+
+### Debug Mode
+
+Enable debug logging:
+
+```bash
+export FD_DEBUG=1
+```
+
+## More Resources
+
+- [Mooncake Official Documentation](https://github.com/kvcache-ai/Mooncake)
+- [Mooncake Troubleshooting Guide](https://github.com/kvcache-ai/Mooncake/blob/main/docs/source/troubleshooting/troubleshooting.md)
+- [FastDeploy Documentation](https://paddlepaddle.github.io/FastDeploy/)