Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

DeepSeek V3.2 NVFP4: Aggregated Round Robin vs Disaggregated KV Routing with WideEP

This GB200 NVL72 recipe for DeepSeek V3.2 demonstrates the performance difference between aggregated (round-robin) routing and disaggregated (KV-aware) routing + WideEP on a synthetic trace dataset adapted from the Mooncake FAST25 paper.

Results

dsv32_agg_vs_disagg.mov

Experiment Overview

We compare two deployment modes on 32x GB200 GPUs across 8 nodes:

Mode Routing Configuration
Aggregated Round-robin 4x DEP8 workers
Disaggregated KV-aware 2x prefill + 2x decode w/ WideEP (DEP8)

Dataset: Mooncake-based Synthetic Coding Trace

The benchmark uses a trace which simulates coding workloads. We synthesize the trace by increasing the input sequence length and prefix reuse rate of the original Mooncake conversation trace.

To reproduce our benchmark, run Dynamo's prefix data generator tool on the Mooncake conversation_trace.jsonl:

datagen synthesize \
    --input-file conversation_trace.jsonl \
    --prefix-len-multiplier 16 \
    --prompt-len-multiplier 10 \
    --max-isl 110000 \
    --num-requests 10000
# synthesizes `conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl`

The ISL/OSL/cache hit statistics of our trace is below.

Dataset statistics: Mooncake-based Synthetic Trace
============================================================
  DATASET ANALYSIS: Mooncake-based Synthetic Trace
  ============================================================
  OVERVIEW
  ----------------------------------------
    Total Requests:      10,000
    Unique Hash Blocks:  430,838
    Total Hash Blocks:   770,934
  INPUT SEQUENCE LENGTH (ISL)
  ----------------------------------------
    Average:             39,186 tokens
    Maximum:             109,459 tokens
    Minimum:             12,801 tokens
  OUTPUT SEQUENCE LENGTH (OSL)
  ----------------------------------------
    Average:             344 tokens
    Maximum:             2,000 tokens
    Minimum:             1 tokens
  KV CACHE / PREFIX REUSE
  ----------------------------------------
    Block-level Hit Rate: 44.1%
    Token-level Hit Rate: 44.0%
    Avg Context (shared): 22,400 tokens/req
    Avg Unique Prompt:    16,786 tokens/req
    Shared Prefix Ratio:  57.2%
  ============================================================

  Summary:
  • ~44% KV cache hit rate (block/token level) based on hash_id overlap across requests
  • ~57% of input tokens come from shared context prefixes
  • Long-context workload: avg 39K input tokens, up to 109K max

Prerequisites

  1. Dynamo Platform installed - See Kubernetes Deployment Guide
  2. 32x GB200 GPUs across 8 nodes
  3. HuggingFace token configured:
    export NAMESPACE=your-namespace
    kubectl create secret generic hf-token-secret \
      --from-literal=HF_TOKEN="your-token" \
      -n ${NAMESPACE}

Quick Start

1. Create Storage

Note: Edit model-cache/model-cache.yaml first and update storageClassName to match your cluster (run kubectl get storageclass to find available options).

kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}

2. Configure K8 Benchmarking Environment

For multinode kubernetes deployments, your cluster may require a ComputeDomain to exist in your namespace such that the DRA scheduler can co-locate worker pods on MNNVL-connected nodes. (Otherwise, internode GPU peer memory access would fail.)

kubectl apply -f model-cache/compute-domain.yaml -n ${NAMESPACE}

Make sure to apply any name modifications to this file to the deployment yamls, under extraPodSpec.resourceClaims and mainContainer.resources.claims.

3. Setup Model and Data

We use NVIDIA's official NVFP4-quantized checkpoint (Huggingface). Copy it into the PVC storage:

kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=600s

Similarly, copy the trace file for the benchmark into the PVC:

# conversation_trace_synth_16.00x1+10.00_speedup1_maxisl110000.jsonl in our case
kubectl cp <local_trace.jsonl> your-namespace/<helper-pod>:/model-cache/traces/

4. Deploy & Benchmark

Option A: Aggregated (Round-Robin Baseline)

# Deploy
kubectl apply -f trtllm/agg-round-robin/deploy.yaml -n ${NAMESPACE}

# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=agg-round-robin-dsv32-nvfp4 \
  -n ${NAMESPACE} --timeout=1200s

# Run benchmark
kubectl apply -f trtllm/agg-round-robin/perf.yaml -n ${NAMESPACE}

Option B: Disaggregated (KV-Aware Routing)

# Deploy
kubectl apply -f trtllm/disagg-kv-router/deploy.yaml -n ${NAMESPACE}

# Wait for ready
kubectl wait --for=condition=ready pod -l nvidia.com/dynamo-graph-deployment-name=disagg-kv-dsv32-nvfp4 \
  -n ${NAMESPACE} --timeout=1200s

# Run benchmark
kubectl apply -f trtllm/disagg-kv-router/perf.yaml -n ${NAMESPACE}

4. Monitor Benchmark Progress

The benchmark runs inside a tmux session for easy monitoring:

# Find the benchmark pod
kubectl get pods -n ${NAMESPACE} | grep benchmark

# Attach to the tmux session to see intermediate results
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- tmux a -t benchmark

# Detach from tmux: Ctrl+B, then D

5. View Results

Results are saved to the perf-cache PVC:

# Check artifact directory
kubectl exec -it -n ${NAMESPACE} <benchmark-pod-name> -- ls -la /perf-cache/artifacts/

# Copy results to local machine
kubectl cp ${NAMESPACE}/<benchmark-pod-name>:/perf-cache/artifacts ./benchmark-results

Expected Results

Since the benchmark uses --fixed-schedule (replaying requests at their original timestamps), throughput metrics are fixed by the trace—latency metrics are what we're comparing:

Metric Why It Matters
TTFT (Time to First Token) KV-aware routing reduces prefill compute via prefix cache hits
ITL (Inter-Token Latency) Disaggregated serving isolates decode from prefill interference
Total Request Latency Combined benefit of both optimizations

For production contexts, we can further evaluate the deployments with goodput, i.e. the rate of requests which satisfy a predetermined service level agreement (SLA). For our experiments, we set the SLA as TTFT=20s and ITL=50ms.

Cleanup

# Delete benchmark pods
kubectl delete job agg-round-robin-dsv32-nvfp4-bench disagg-kv-dsv32-nvfp4-bench -n ${NAMESPACE}

# Delete deployments
kubectl delete dynamographdeployment agg-round-robin-dsv32-nvfp4 -n ${NAMESPACE}
kubectl delete dynamographdeployment disagg-kv-dsv32-nvfp4 -n ${NAMESPACE}

References