This GB200 NVL72 recipe for DeepSeek V3.2 demonstrates the performance difference between aggregated (round-robin) routing and disaggregated (KV-aware) routing + WideEP on a synthetic trace dataset adapted from the Mooncake FAST25 paper.
dsv32_agg_vs_disagg.mov
We compare two deployment modes on 32x GB200 GPUs across 8 nodes:
| Mode | Routing | Configuration |
|---|---|---|
| Aggregated | Round-robin | 4x DEP8 workers |
| Disaggregated | KV-aware | 2x prefill + 2x decode w/ WideEP (DEP8) |
The benchmark uses a trace which simulates coding workloads. We synthesize the trace by increasing the input sequence length and prefix reuse rate of the original Mooncake conversation trace.
To reproduce our benchmark, run Dynamo's prefix data generator tool on the Mooncake conversation_trace.jsonl:
datagen synthesize \
--input-file conversation_trace.jsonl \
--prefix-len-multiplier 16 \
--prompt-len-multiplier 10 \
--max-isl 110000 \
--num-requests 10000
# synthesizes `conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl`The ISL/OSL/cache hit statistics of our trace is below.
Dataset statistics: Mooncake-based Synthetic Trace
============================================================
DATASET ANALYSIS: Mooncake-based Synthetic Trace
============================================================
OVERVIEW
----------------------------------------
Total Requests: 10,000
Unique Hash Blocks: 430,838
Total Hash Blocks: 770,934
INPUT SEQUENCE LENGTH (ISL)
----------------------------------------
Average: 39,186 tokens
Maximum: 109,459 tokens
Minimum: 12,801 tokens
OUTPUT SEQUENCE LENGTH (OSL)
----------------------------------------
Average: 344 tokens
Maximum: 2,000 tokens
Minimum: 1 tokens
KV CACHE / PREFIX REUSE
----------------------------------------
Block-level Hit Rate: 44.1%
Token-level Hit Rate: 44.0%
Avg Context (shared): 22,400 tokens/req
Avg Unique Prompt: 16,786 tokens/req
Shared Prefix Ratio: 57.2%
============================================================
Summary:
• ~44% KV cache hit rate (block/token level) based on hash_id overlap across requests
• ~57% of input tokens come from shared context prefixes
• Long-context workload: avg 39K input tokens, up to 109K max
- Dynamo Platform installed - See Kubernetes Deployment Guide
- 32x GB200 GPUs across 8 nodes
- HuggingFace token configured:
export NAMESPACE=your-namespace kubectl create secret generic hf-token-secret \ --from-literal=HF_TOKEN="your-token" \ -n ${NAMESPACE}
Note: Edit
model-cache/model-cache.yamlfirst and updatestorageClassNameto match your cluster (runkubectl get storageclassto find available options).
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}For multinode kubernetes deployments, your cluster may require a ComputeDomain to exist in your namespace such that the DRA scheduler can co-locate worker pods on MNNVL-connected nodes. (Otherwise, internode GPU peer memory access would fail.)
kubectl apply -f model-cache/compute-domain.yaml -n ${NAMESPACE}Make sure to apply any name modifications to this file to the deployment yamls, under extraPodSpec.resourceClaims and mainContainer.resources.claims.
We use NVIDIA's official NVFP4-quantized checkpoint (Huggingface). Copy it into the PVC storage:
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600sSimilarly, copy the trace file for the benchmark into the PVC:
# conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl in our case
kubectl cp <local_trace.jsonl> your-namespace/<helper-pod>:/model-cache/traces/Option A: Aggregated (Round-Robin Baseline)
# Deploy
kubectl apply -f trtllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}
# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-round-robin-dsv32-nvfp4 \
-n ${NAMESPACE} --timeout=1200s
# Run benchmark
kubectl apply -f trtllm/agg-round-robin/perf.yaml -n ${NAMESPACE}Option B: Disaggregated (KV-Aware Routing)
# Deploy
kubectl apply -f trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}
# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
-n ${NAMESPACE} --timeout=1200s
# Run benchmark
kubectl apply -f trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}The benchmark runs inside a tmux session for easy monitoring:
# Find the benchmark pod
kubectl get pods -n ${NAMESPACE} | grep benchmark
# Attach to the tmux session to see intermediate results
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark
# Detach from tmux: Ctrl+B, then DResults are saved to the perf-cache PVC:
# Check artifact directory
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- ls -la /perf-cache/artifacts/
# Copy results to local machine
kubectl cp ${NAMESPACE}/<benchmark-pod-name>:/perf-cache/artifacts ./benchmark-resultsSince the benchmark uses --fixed-schedule (replaying requests at their original timestamps), throughput metrics are fixed by the trace—latency metrics are what we're comparing:
| Metric | Why It Matters |
|---|---|
| TTFT (Time to First Token) | KV-aware routing reduces prefill compute via prefix cache hits |
| ITL (Inter-Token Latency) | Disaggregated serving isolates decode from prefill interference |
| Total Request Latency | Combined benefit of both optimizations |
For production contexts, we can further evaluate the deployments with goodput, i.e. the rate of requests which satisfy a predetermined service level agreement (SLA). For our experiments, we set the SLA as TTFT=20s and ITL=50ms.
# Delete benchmark pods
kubectl delete job agg-round-robin-dsv32-nvfp4-bench disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}
# Delete deployments
kubectl delete dynamographdeployment agg-round-robin-dsv32-nvfp4 -n ${NAMESPACE}
kubectl delete dynamographdeployment disagg-kv-dsv32-nvfp4 -n ${NAMESPACE}- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving - FAST25 paper and trace data
- Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs - TRTLLM tech blog on available optimizations for DSV3.2 on GB200