Name	Name	Last commit message	Last commit date
parent directory ..
model-cache	model-cache
trtllm	trtllm
README.md	README.md

DeepSeek V3.2 NVFP4: Aggregated Round Robin vs Disaggregated KV Routing with WideEP

This GB200 NVL72 recipe for DeepSeek V3.2 demonstrates the performance difference between aggregated (round-robin) routing and disaggregated (KV-aware) routing + WideEP on a synthetic trace dataset adapted from the Mooncake FAST25 paper.

Results

dsv32_agg_vs_disagg.mov

Experiment Overview

We compare two deployment modes on 32x GB200 GPUs across 8 nodes:

Mode	Routing	Configuration
Aggregated	Round-robin	4x DEP8 workers
Disaggregated	KV-aware	2x prefill + 2x decode w/ WideEP (DEP8)

Dataset: Mooncake-based Synthetic Coding Trace

The benchmark uses a trace which simulates coding workloads. We synthesize the trace by increasing the input sequence length and prefix reuse rate of the original Mooncake conversation trace.

To reproduce our benchmark, run Dynamo's prefix data generator tool on the Mooncake conversation_trace.jsonl:

datagen synthesize \
    --input-file conversation_trace.jsonl \
    --prefix-len-multiplier 16 \
    --prompt-len-multiplier 10 \
    --max-isl 110000 \
    --num-requests 10000
# synthesizes `conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl`

The ISL/OSL/cache hit statistics of our trace is below.

Dataset statistics: Mooncake-based Synthetic Trace

============================================================
  DATASET ANALYSIS: Mooncake-based Synthetic Trace
  ============================================================
  OVERVIEW
  ----------------------------------------
    Total Requests:      10,000
    Unique Hash Blocks:  430,838
    Total Hash Blocks:   770,934
  INPUT SEQUENCE LENGTH (ISL)
  ----------------------------------------
    Average:             39,186 tokens
    Maximum:             109,459 tokens
    Minimum:             12,801 tokens
  OUTPUT SEQUENCE LENGTH (OSL)
  ----------------------------------------
    Average:             344 tokens
    Maximum:             2,000 tokens
    Minimum:             1 tokens
  KV CACHE / PREFIX REUSE
  ----------------------------------------
    Block-level Hit Rate: 44.1%
    Token-level Hit Rate: 44.0%
    Avg Context (shared): 22,400 tokens/req
    Avg Unique Prompt:    16,786 tokens/req
    Shared Prefix Ratio:  57.2%
  ============================================================

  Summary:
  • ~44% KV cache hit rate (block/token level) based on hash_id overlap across requests
  • ~57% of input tokens come from shared context prefixes
  • Long-context workload: avg 39K input tokens, up to 109K max

Prerequisites

Dynamo Platform installed - See Kubernetes Deployment Guide
32x GB200 GPUs across 8 nodes

HuggingFace token configured:

export NAMESPACE=your-namespace
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token" \
  -n ${NAMESPACE}

Quick Start

1. Create Storage

Note: Edit model-cache/model-cache.yaml first and update storageClassName to match your cluster (run kubectl get storageclass to find available options).

kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}

2. Configure K8 Benchmarking Environment

For multinode kubernetes deployments, your cluster may require a ComputeDomain to exist in your namespace such that the DRA scheduler can co-locate worker pods on MNNVL-connected nodes. (Otherwise, internode GPU peer memory access would fail.)

kubectl apply -f model-cache/compute-domain.yaml -n ${NAMESPACE}

Make sure to apply any name modifications to this file to the deployment yamls, under extraPodSpec.resourceClaims and mainContainer.resources.claims.

3. Setup Model and Data

We use NVIDIA's official NVFP4-quantized checkpoint (Huggingface). Copy it into the PVC storage:

kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s

Similarly, copy the trace file for the benchmark into the PVC:

# conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl in our case
kubectl cp <local_trace.jsonl> your-namespace/<helper-pod>:/model-cache/traces/

4. Deploy & Benchmark

Option A: Aggregated (Round-Robin Baseline)

# Deploy
kubectl apply -f trtllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}

# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-round-robin-dsv32-nvfp4 \
  -n ${NAMESPACE} --timeout=1200s

# Run benchmark
kubectl apply -f trtllm/agg-round-robin/perf.yaml -n ${NAMESPACE}

Option B: Disaggregated (KV-Aware Routing)

# Deploy
kubectl apply -f trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}

# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
  -n ${NAMESPACE} --timeout=1200s

# Run benchmark
kubectl apply -f trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}

4. Monitor Benchmark Progress

The benchmark runs inside a tmux session for easy monitoring:

# Find the benchmark pod
kubectl get pods -n ${NAMESPACE} | grep benchmark

# Attach to the tmux session to see intermediate results
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark

# Detach from tmux: Ctrl+B, then D

5. View Results

Results are saved to the perf-cache PVC:

# Check artifact directory
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- ls -la /perf-cache/artifacts/

# Copy results to local machine
kubectl cp ${NAMESPACE}/<benchmark-pod-name>:/perf-cache/artifacts ./benchmark-results

Expected Results

Since the benchmark uses --fixed-schedule (replaying requests at their original timestamps), throughput metrics are fixed by the trace—latency metrics are what we're comparing:

Metric	Why It Matters
TTFT (Time to First Token)	KV-aware routing reduces prefill compute via prefix cache hits
ITL (Inter-Token Latency)	Disaggregated serving isolates decode from prefill interference
Total Request Latency	Combined benefit of both optimizations

For production contexts, we can further evaluate the deployments with goodput, i.e. the rate of requests which satisfy a predetermined service level agreement (SLA). For our experiments, we set the SLA as TTFT=20s and ITL=50ms.

Cleanup

# Delete benchmark pods
kubectl delete job agg-round-robin-dsv32-nvfp4-bench disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}

# Delete deployments
kubectl delete dynamographdeployment agg-round-robin-dsv32-nvfp4 -n ${NAMESPACE}
kubectl delete dynamographdeployment disagg-kv-dsv32-nvfp4 -n ${NAMESPACE}

References

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving - FAST25 paper and trace data
Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs - TRTLLM tech blog on available optimizations for DSV3.2 on GB200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

DeepSeek V3.2 NVFP4: Aggregated Round Robin vs Disaggregated KV Routing with WideEP

Results

Experiment Overview

Dataset: Mooncake-based Synthetic Coding Trace

Prerequisites

Quick Start

1. Create Storage

2. Configure K8 Benchmarking Environment

3. Setup Model and Data

4. Deploy & Benchmark

4. Monitor Benchmark Progress

5. View Results

Expected Results

Cleanup

References

FilesExpand file tree

deepseek-v32-fp4

Directory actions

More options

Directory actions

More options

Latest commit

History

deepseek-v32-fp4

Folders and files

parent directory

README.md

DeepSeek V3.2 NVFP4: Aggregated Round Robin vs Disaggregated KV Routing with WideEP

Results

Experiment Overview

Dataset: Mooncake-based Synthetic Coding Trace

Prerequisites

Quick Start

1. Create Storage

2. Configure K8 Benchmarking Environment

3. Setup Model and Data

4. Deploy & Benchmark

4. Monitor Benchmark Progress

5. View Results

Expected Results

Cleanup

References