Aggregated-serving recipes for DeepSeek-V4-Flash on Dynamo. Two backends (vLLM and SGLang) and two hardware targets (B200 and GB200) are documented side by side. All four variants are single-replica decode-only deployments using 4 GPUs.
| Variant | Backend | Hardware | Manifest | Topology | Container |
|---|---|---|---|---|---|
| vllm-agg-b200 | vLLM | 4x B200 | vllm/agg_b200/deploy.yaml |
DP=4 + Expert Parallel, TP=1 | Prebuilt NGC image (...1.2.0-deepseek-v4-cuda13-dev.2, multi-arch) |
| vllm-agg-gb200 | vLLM | 4x GB200 | vllm/agg_gb200/deploy.yaml |
TP=4 + Expert Parallel, deep_gemm_mega_moe |
Prebuilt NGC image (...1.2.0-deepseek-v4-cuda13-dev.2, multi-arch) |
| sglang-agg | SGLang | 4x B200 | sglang/agg/deploy.yaml |
TP=4, MXFP4 MoE via FlashInfer, EAGLE MTP 3/4 | Prebuilt NGC image (...1.2.0-deepseek-v4-cuda12-dev.2); optional custom build |
| sglang-agg-gb200 | SGLang | 4x GB200 | sglang/agg-gb200/deploy.yaml |
TP=4, MXFP4 MoE via FlashInfer, EAGLE MTP 3/4 | Prebuilt NGC image (...1.2.0-deepseek-v4-cuda13-dev.2, arm64) |
The B200 variants fill 4 of 8 GPUs on a B200 node; the GB200 variants fill all 4 GPUs of a single GB200 NVL4 tray.
Status: Experimental (Day-0). Modality: text only.
- Dynamo Platform installed — see the Kubernetes Deployment Guide.
- GPU cluster. At least 4 GPUs of the matching arch available on one node:
- B200 variants: 4 B200 GPUs (x86_64).
- GB200 variants: 4 GB200 GPUs (single NVL4 tray, arm64). Nodes must be labeled
nvidia.com/gpu.product=NVIDIA-GB200and taintedkubernetes.io/arch=arm64:NoSchedule(the manifests carry the matchingnodeSelector+toleration).
- HuggingFace token with access to
deepseek-ai/DeepSeek-V4-Flash.
Common setup (run once — applies to all variants):
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# HuggingFace token secret (consumed by the download Job and, as a convenience, by the worker)
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model into the model-cache PVC.
# Edit model-cache/model-cache.yaml and set storageClassName to a RWX class in your cluster.
# The PVC requests 400Gi; DeepSeek-V4-Flash is ~160GB on disk (46 safetensors shards,
# FP4+FP8 mixed) and typically takes 30-60 min to download on first apply.
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=7200skubectl apply -f vllm/agg_b200/deploy.yaml -n ${NAMESPACE}
# First launch of the decode worker takes up to ~60 minutes (weight load +
# FlashInfer autotune + cudagraph warmup). The startup probe is sized for this.
kubectl wait --for=condition=Ready pod \
-l nvidia.com/dynamo-graph-deployment-name=dsv4-flash-agg \
-n ${NAMESPACE} --timeout=3600skubectl apply -f vllm/agg_gb200/deploy.yaml -n ${NAMESPACE}
# First launch ~60 minutes; the manifest's startup probe allows for it.
kubectl wait --for=condition=Ready pod \
-l nvidia.com/dynamo-graph-deployment-name=dsv4-flash-agg \
-n ${NAMESPACE} --timeout=3600skubectl apply -f sglang/agg/deploy.yaml -n ${NAMESPACE}
# First launch of the decode worker takes up to ~60 minutes (weight load +
# DeepGEMM warmup + cudagraph warmup). The startup probe is sized for this.
kubectl wait --for=condition=Ready pod \
-l nvidia.com/dynamo-graph-deployment-name=sglang-dsv4-flash \
-n ${NAMESPACE} --timeout=3600skubectl apply -f sglang/agg-gb200/deploy.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Ready pod \
-l nvidia.com/dynamo-graph-deployment-name=sglang-dsv4-flash \
-n ${NAMESPACE} --timeout=3600sPort-forward the variant you deployed:
# vLLM
kubectl port-forward svc/dsv4-flash-agg-frontend 8000:8000 -n ${NAMESPACE}
# SGLang
kubectl port-forward svc/sglang-dsv4-flash-frontend 8000:8000 -n ${NAMESPACE}Either way the request shape is the same — same model name, same OpenAI-compatible endpoints:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'| Flag | Purpose |
|---|---|
--tokenizer-mode deepseek_v4 |
Selects the DeepSeek-V4 tokenizer |
--dyn-reasoning-parser deepseek_v4 |
Extracts chain-of-thought into message.reasoning_content |
--dyn-tool-call-parser deepseek_v4 |
Emits OpenAI-compatible structured tool_calls |
--attention-config '{"use_fp4_indexer_cache":true}' |
Blackwell FP4 indexer cache for CSA+HCA attention |
--kv-cache-dtype fp8 + --block-size 256 |
FP8 KV cache; block size matches the upstream recipe |
--tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel |
DP=4 + EP across the 4 GPUs (TP=1) |
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' |
Single-node DEP compilation config from the upstream recipe |
--no-enable-flashinfer-autotune |
Skip per-shape FlashInfer autotuning at startup; required on dsv4 for correct accuracy |
--max-num-seqs 256 |
Concurrency cap |
Same OpenAI-renderer wiring as the B200 variant; differences below come from the upstream vLLM GB200 recipe for V4-Flash.
| Flag / env | Purpose |
|---|---|
--tensor-parallel-size 4 --enable-expert-parallel |
TP=4 + EP across the 4 GPUs of the NVL4 tray (DP dropped — the GB200 tray's intra-tray NVLink makes TP attractive for this size class) |
--moe-backend deep_gemm_mega_moe |
DeepGEMM "mega MoE" kernel — the optimized FP8 MoE path for V4 expert routing on Blackwell |
--no-enable-flashinfer-autotune |
Skip per-shape FlashInfer autotuning at startup; required on dsv4 for correct accuracy |
NCCL_NVLS_ENABLE=1, NCCL_P2P_LEVEL=NVL, VLLM_USE_NCCL_SYMM_MEM=1 |
Enable NVLink Sharp (NVLS) multicast for one-shot all-reduce on the tray |
| Flag | Purpose |
|---|---|
--dyn-reasoning-parser deepseek_v4 |
Extracts chain-of-thought into message.reasoning_content |
--dyn-tool-call-parser deepseek_v4 |
Emits OpenAI-compatible structured tool_calls |
--trust-remote-code |
Required for the V4 architecture's custom modeling code |
--tp 4 |
Tensor-parallel across the 4 GPUs of one node |
--moe-runner-backend flashinfer_mxfp4 |
MXFP4 MoE kernel via FlashInfer for the V4 expert weights |
--speculative-algo EAGLE + --speculative-num-steps 3 + --speculative-eagle-topk 1 + --speculative-num-draft-tokens 4 |
EAGLE MTP speculative decoding (3 draft steps, top-1 over the EAGLE head, 4 draft tokens per step) |
--chunked-prefill-size 4096 |
Chunk long prompts at 4k tokens for steady-state decode interleaving |
--disable-flashinfer-autotune |
Skip per-shape autotuning at startup; the dsv4 base ships pre-tuned defaults |
| Model | deepseek-ai/DeepSeek-V4-Flash (MoE, 284B total / 13B active) |
| Checkpoint | Mixed FP4 (expert weights) + FP8 (attention, norm, router) |
| Attention | Hybrid CSA + HCA with Blackwell FP4 indexer cache |
Recipe-level (per-variant) settings:
vLLM B200 (vllm-agg-b200) |
vLLM GB200 (vllm-agg-gb200) |
SGLang B200 (sglang-agg) |
SGLang GB200 (sglang-agg-gb200) |
|
|---|---|---|---|---|
| Backend image | Prebuilt nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2 (multi-arch) |
Prebuilt nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2 (multi-arch) |
Prebuilt nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.2 |
Prebuilt nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.2 |
| Parallelism | DP=4 + Expert Parallel, TP=1 | TP=4 + Expert Parallel | TP=4 | TP=4 |
| MoE backend | vLLM's V4 expert kernel (FP4) | DeepGEMM mega MoE | FlashInfer MXFP4 | FlashInfer MXFP4 |
| KV cache | FP8, block size 256 | FP8, block size 256 | engine default | engine default |
| Speculative decoding | — | — | EAGLE MTP (3 steps / 4 draft tokens) | EAGLE MTP (3 steps / 4 draft tokens) |
Same flow on both variants — same model, same --dyn-reasoning-parser deepseek_v4:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
"max_tokens": 200
}' | python3 -m json.toolExpected:
choices[0].message.reasoning_contentcontains the model's chain-of-thought.choices[0].message.contentcontains only the final answer.- No raw
</think>tags in either field.
If reasoning_content is null and </think> appears in content, the reasoning parser isn't wired up — confirm --dyn-reasoning-parser deepseek_v4 is on the worker command.
Same flow on both variants:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}
}],
"max_tokens": 300
}' | python3 -m json.toolExpected:
choices[0].message.tool_callsis a structured array withfunction.name,function.arguments, andid.choices[0].finish_reasonis"tool_calls".choices[0].message.reasoning_contentmay contain the model's reasoning about tool selection.
If tool_calls is missing and raw tool-call markers appear in content, confirm --dyn-tool-call-parser deepseek_v4 is on the worker command.
- Storage class. Update
storageClassNameinmodel-cache/model-cache.yamlto a RWX class that can serve the PVC to Frontend and worker pods. - Model size.
deepseek-ai/DeepSeek-V4-Flashis ~160 GB on disk (46 safetensors shards in FP4+FP8 mixed form). The 400Gi PVC leaves headroom for HF cache metadata and one alternate revision. - Parser flags. Use the Dynamo variants on the worker (
--dyn-reasoning-parser,--dyn-tool-call-parser). Each engine's native--reasoning-parser/--tool-call-parserare engine-side and do not feed the Dynamo OpenAI renderer. - Offline model cache. Both workers run with
HF_HUB_OFFLINE=1so the engine reads cached weights from the PVC and never contacts the HF Hub at startup. The HF token secret is mounted defensively; it isn't required at runtime once the download Job has completed. - First launch is slow. Decode workers load weights and warm CUDA graphs / DeepGEMM kernels on first launch; the manifests' startup probes allow up to ~60 min (
failureThreshold: 360atperiodSeconds: 10).
- Prebuilt images. Both
vllm/agg_b200/deploy.yamlandvllm/agg_gb200/deploy.yamlreferencenvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2(multi-arch). To rebuild from source (custom Dynamo branch, different vLLM base, etc.), see<repo_root>/container/README.md. - Engine-ready timeout.
VLLM_ENGINE_READY_TIMEOUT_S=3600matches the startup probe budget on both variants. - DP stability (B200 only).
VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1andVLLM_SKIP_P2P_CHECK=1mirror the DeepSeek-R1 vLLM recipe and stabilize DP dummy inputs. The GB200 variant uses TP (no DP), soVLLM_RANDOMIZE_DP_DUMMY_INPUTSis not set. - FlashInfer autotune.
--no-enable-flashinfer-autotuneskips per-shape FlashInfer autotuning at startup and is set on both vLLM variants. Required on dsv4: the autotuner currently produces tunings that regress GSM8k accuracy. Skipping it also shortens first-launch warmup. - FlashInfer TRT-LLM allreduce on GB200. You may see a non-fatal startup warning
Failed to initialize FlashInfer Allreduce norm fusion workspace ... Flashinfer allreduce-norm fusion will be disabled. vLLM falls back to a non-fused allreduce + RMSNorm; correctness is unaffected. To enable the fused kernel, set the compilation pass:--compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"],"pass_config":{"fuse_allreduce_rms":true}}'.
- Prebuilt images.
sglang/agg/deploy.yamlreferencesnvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.2andsglang/agg-gb200/deploy.yamlreferences the arm64 siblingnvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.2. To rebuild either (custom Dynamo branch, different SGLang base, etc.), seerecipes/deepseek-v4/container/README.md. - DeepGEMM / FlashInfer warmup.
SGLANG_JIT_DEEPGEMM_PRECOMPILE=0+SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1skip the slow precompile and use the fast warmup path.--disable-flashinfer-autotuneskips per-shape FlashInfer autotuning at startup; the dsv4 base ships pre-tuned defaults. - NCCL / Gloo.
NCCL_CUMEM_ENABLE=1is set for V4 NCCL collectives on Blackwell.GLOO_SOCKET_IFNAME=eth0pins Gloo to the standard pod interface.
DeepSeek-V4-Pro is the larger sibling (1.6T / 49B active, 1M context, 8x B200) and shares the same dsv4 vLLM and SGLang container images.