Name	Name	Last commit message	Last commit date
parent directory ..
model-cache	model-cache
sglang	sglang
vllm	vllm
README.md	README.md

DeepSeek-V4-Flash Recipe

Aggregated-serving recipes for DeepSeek-V4-Flash on Dynamo. Two backends (vLLM and SGLang) and two hardware targets (B200 and GB200) are documented side by side. All four variants are single-replica decode-only deployments using 4 GPUs.

Variant	Backend	Hardware	Manifest	Topology	Container
vllm-agg-b200	vLLM	4x B200	`vllm/agg_b200/deploy.yaml`	DP=4 + Expert Parallel, TP=1	Prebuilt NGC image (`...1.2.0-deepseek-v4-cuda13-dev.2`, multi-arch)
vllm-agg-gb200	vLLM	4x GB200	`vllm/agg_gb200/deploy.yaml`	TP=4 + Expert Parallel, `deep_gemm_mega_moe`	Prebuilt NGC image (`...1.2.0-deepseek-v4-cuda13-dev.2`, multi-arch)
sglang-agg	SGLang	4x B200	`sglang/agg/deploy.yaml`	TP=4, MXFP4 MoE via FlashInfer, EAGLE MTP 3/4	Prebuilt NGC image (`...1.2.0-deepseek-v4-cuda12-dev.2`); optional custom build
sglang-agg-gb200	SGLang	4x GB200	`sglang/agg-gb200/deploy.yaml`	TP=4, MXFP4 MoE via FlashInfer, EAGLE MTP 3/4	Prebuilt NGC image (`...1.2.0-deepseek-v4-cuda13-dev.2`, arm64)

The B200 variants fill 4 of 8 GPUs on a B200 node; the GB200 variants fill all 4 GPUs of a single GB200 NVL4 tray.

Status: Experimental (Day-0). Modality: text only.

Prerequisites

Dynamo Platform installed — see the Kubernetes Deployment Guide.
GPU cluster. At least 4 GPUs of the matching arch available on one node:
- B200 variants: 4 B200 GPUs (x86_64).
- GB200 variants: 4 GB200 GPUs (single NVL4 tray, arm64). Nodes must be labeled nvidia.com/gpu.product=NVIDIA-GB200 and tainted kubernetes.io/arch=arm64:NoSchedule (the manifests carry the matching nodeSelector + toleration).
HuggingFace token with access to deepseek-ai/DeepSeek-V4-Flash.

Quick Start

Common setup (run once — applies to all variants):

export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}

# HuggingFace token secret (consumed by the download Job and, as a convenience, by the worker)
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}

# Download model into the model-cache PVC.
# Edit model-cache/model-cache.yaml and set storageClassName to a RWX class in your cluster.
# The PVC requests 400Gi; DeepSeek-V4-Flash is ~160GB on disk (46 safetensors shards,
# FP4+FP8 mixed) and typically takes 30-60 min to download on first apply.
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
kubectl wait --for=condition=Complete job/model-download -n ${NAMESPACE} --timeout=7200s

Deploy — vLLM B200 (`vllm-agg-b200`)

kubectl apply -f vllm/agg_b200/deploy.yaml -n ${NAMESPACE}

# First launch of the decode worker takes up to ~60 minutes (weight load +
# FlashInfer autotune + cudagraph warmup). The startup probe is sized for this.
kubectl wait --for=condition=Ready pod \
  -l nvidia.com/dynamo-graph-deployment-name=dsv4-flash-agg \
  -n ${NAMESPACE} --timeout=3600s

Deploy — vLLM GB200 (`vllm-agg-gb200`)

kubectl apply -f vllm/agg_gb200/deploy.yaml -n ${NAMESPACE}

# First launch ~60 minutes; the manifest's startup probe allows for it.
kubectl wait --for=condition=Ready pod \
  -l nvidia.com/dynamo-graph-deployment-name=dsv4-flash-agg \
  -n ${NAMESPACE} --timeout=3600s

Deploy — SGLang B200 (`sglang-agg`)

kubectl apply -f sglang/agg/deploy.yaml -n ${NAMESPACE}

# First launch of the decode worker takes up to ~60 minutes (weight load +
# DeepGEMM warmup + cudagraph warmup). The startup probe is sized for this.
kubectl wait --for=condition=Ready pod \
  -l nvidia.com/dynamo-graph-deployment-name=sglang-dsv4-flash \
  -n ${NAMESPACE} --timeout=3600s

Deploy — SGLang GB200 (`sglang-agg-gb200`)

kubectl apply -f sglang/agg-gb200/deploy.yaml -n ${NAMESPACE}

kubectl wait --for=condition=Ready pod \
  -l nvidia.com/dynamo-graph-deployment-name=sglang-dsv4-flash \
  -n ${NAMESPACE} --timeout=3600s

Test the Deployment

Port-forward the variant you deployed:

# vLLM
kubectl port-forward svc/dsv4-flash-agg-frontend 8000:8000 -n ${NAMESPACE}

# SGLang
kubectl port-forward svc/sglang-dsv4-flash-frontend 8000:8000 -n ${NAMESPACE}

Either way the request shape is the same — same model name, same OpenAI-compatible endpoints:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Recipe Details

vLLM B200 (`vllm/agg_b200/deploy.yaml`)

Flag	Purpose
`--tokenizer-mode deepseek_v4`	Selects the DeepSeek-V4 tokenizer
`--dyn-reasoning-parser deepseek_v4`	Extracts chain-of-thought into `message.reasoning_content`
`--dyn-tool-call-parser deepseek_v4`	Emits OpenAI-compatible structured `tool_calls`
`--attention-config '{"use_fp4_indexer_cache":true}'`	Blackwell FP4 indexer cache for CSA+HCA attention
`--kv-cache-dtype fp8` + `--block-size 256`	FP8 KV cache; block size matches the upstream recipe
`--tensor-parallel-size 1 --data-parallel-size 4 --enable-expert-parallel`	DP=4 + EP across the 4 GPUs (TP=1)
`--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'`	Single-node DEP compilation config from the upstream recipe
`--no-enable-flashinfer-autotune`	Skip per-shape FlashInfer autotuning at startup; required on dsv4 for correct accuracy
`--max-num-seqs 256`	Concurrency cap

vLLM GB200 (`vllm/agg_gb200/deploy.yaml`)

Same OpenAI-renderer wiring as the B200 variant; differences below come from the upstream vLLM GB200 recipe for V4-Flash.

Flag / env	Purpose
`--tensor-parallel-size 4 --enable-expert-parallel`	TP=4 + EP across the 4 GPUs of the NVL4 tray (DP dropped — the GB200 tray's intra-tray NVLink makes TP attractive for this size class)
`--moe-backend deep_gemm_mega_moe`	DeepGEMM "mega MoE" kernel — the optimized FP8 MoE path for V4 expert routing on Blackwell
`--no-enable-flashinfer-autotune`	Skip per-shape FlashInfer autotuning at startup; required on dsv4 for correct accuracy
`NCCL_NVLS_ENABLE=1`, `NCCL_P2P_LEVEL=NVL`, `VLLM_USE_NCCL_SYMM_MEM=1`	Enable NVLink Sharp (NVLS) multicast for one-shot all-reduce on the tray

SGLang B200 (`sglang/agg/deploy.yaml`)

Flag	Purpose
`--dyn-reasoning-parser deepseek_v4`	Extracts chain-of-thought into `message.reasoning_content`
`--dyn-tool-call-parser deepseek_v4`	Emits OpenAI-compatible structured `tool_calls`
`--trust-remote-code`	Required for the V4 architecture's custom modeling code
`--tp 4`	Tensor-parallel across the 4 GPUs of one node
`--moe-runner-backend flashinfer_mxfp4`	MXFP4 MoE kernel via FlashInfer for the V4 expert weights
`--speculative-algo EAGLE` + `--speculative-num-steps 3` + `--speculative-eagle-topk 1` + `--speculative-num-draft-tokens 4`	EAGLE MTP speculative decoding (3 draft steps, top-1 over the EAGLE head, 4 draft tokens per step)
`--chunked-prefill-size 4096`	Chunk long prompts at 4k tokens for steady-state decode interleaving
`--disable-flashinfer-autotune`	Skip per-shape autotuning at startup; the dsv4 base ships pre-tuned defaults

Model Details


Model	`deepseek-ai/DeepSeek-V4-Flash` (MoE, 284B total / 13B active)
Checkpoint	Mixed FP4 (expert weights) + FP8 (attention, norm, router)
Attention	Hybrid CSA + HCA with Blackwell FP4 indexer cache

Recipe-level (per-variant) settings:

	vLLM B200 (`vllm-agg-b200`)	vLLM GB200 (`vllm-agg-gb200`)	SGLang B200 (`sglang-agg`)	SGLang GB200 (`sglang-agg-gb200`)
Backend image	Prebuilt `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2` (multi-arch)	Prebuilt `nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2` (multi-arch)	Prebuilt `nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.2`	Prebuilt `nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.2`
Parallelism	DP=4 + Expert Parallel, TP=1	TP=4 + Expert Parallel	TP=4	TP=4
MoE backend	vLLM's V4 expert kernel (FP4)	DeepGEMM mega MoE	FlashInfer MXFP4	FlashInfer MXFP4
KV cache	FP8, block size 256	FP8, block size 256	engine default	engine default
Speculative decoding	—	—	EAGLE MTP (3 steps / 4 draft tokens)	EAGLE MTP (3 steps / 4 draft tokens)

Verifying Reasoning

Same flow on both variants — same model, same --dyn-reasoning-parser deepseek_v4:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "What is 2+2? Answer briefly."}],
    "max_tokens": 200
  }' | python3 -m json.tool

Expected:

choices[0].message.reasoning_content contains the model's chain-of-thought.
choices[0].message.content contains only the final answer.
No raw </think> tags in either field.

If reasoning_content is null and </think> appears in content, the reasoning parser isn't wired up — confirm --dyn-reasoning-parser deepseek_v4 is on the worker command.

Verifying Tool Calling

Same flow on both variants:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V4-Flash",
    "messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {"type": "string", "description": "City name"}
          },
          "required": ["location"]
        }
      }
    }],
    "max_tokens": 300
  }' | python3 -m json.tool

Expected:

choices[0].message.tool_calls is a structured array with function.name, function.arguments, and id.
choices[0].finish_reason is "tool_calls".
choices[0].message.reasoning_content may contain the model's reasoning about tool selection.

If tool_calls is missing and raw tool-call markers appear in content, confirm --dyn-tool-call-parser deepseek_v4 is on the worker command.

Notes

Common

Storage class. Update storageClassName in model-cache/model-cache.yaml to a RWX class that can serve the PVC to Frontend and worker pods.
Model size. deepseek-ai/DeepSeek-V4-Flash is ~160 GB on disk (46 safetensors shards in FP4+FP8 mixed form). The 400Gi PVC leaves headroom for HF cache metadata and one alternate revision.
Parser flags. Use the Dynamo variants on the worker (--dyn-reasoning-parser, --dyn-tool-call-parser). Each engine's native --reasoning-parser / --tool-call-parser are engine-side and do not feed the Dynamo OpenAI renderer.
Offline model cache. Both workers run with HF_HUB_OFFLINE=1 so the engine reads cached weights from the PVC and never contacts the HF Hub at startup. The HF token secret is mounted defensively; it isn't required at runtime once the download Job has completed.
First launch is slow. Decode workers load weights and warm CUDA graphs / DeepGEMM kernels on first launch; the manifests' startup probes allow up to ~60 min (failureThreshold: 360 at periodSeconds: 10).

vLLM-specific

Prebuilt images. Both vllm/agg_b200/deploy.yaml and vllm/agg_gb200/deploy.yaml reference nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.2.0-deepseek-v4-cuda13-dev.2 (multi-arch). To rebuild from source (custom Dynamo branch, different vLLM base, etc.), see <repo_root>/container/README.md.
Engine-ready timeout. VLLM_ENGINE_READY_TIMEOUT_S=3600 matches the startup probe budget on both variants.
DP stability (B200 only). VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1 and VLLM_SKIP_P2P_CHECK=1 mirror the DeepSeek-R1 vLLM recipe and stabilize DP dummy inputs. The GB200 variant uses TP (no DP), so VLLM_RANDOMIZE_DP_DUMMY_INPUTS is not set.
FlashInfer autotune. --no-enable-flashinfer-autotune skips per-shape FlashInfer autotuning at startup and is set on both vLLM variants. Required on dsv4: the autotuner currently produces tunings that regress GSM8k accuracy. Skipping it also shortens first-launch warmup.
FlashInfer TRT-LLM allreduce on GB200. You may see a non-fatal startup warning Failed to initialize FlashInfer Allreduce norm fusion workspace ... Flashinfer allreduce-norm fusion will be disabled. vLLM falls back to a non-fused allreduce + RMSNorm; correctness is unaffected. To enable the fused kernel, set the compilation pass: --compilation-config '{"mode":3,"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"],"pass_config":{"fuse_allreduce_rms":true}}'.

SGLang-specific

Prebuilt images. sglang/agg/deploy.yaml references nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda12-dev.2 and sglang/agg-gb200/deploy.yaml references the arm64 sibling nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.2.0-deepseek-v4-cuda13-dev.2. To rebuild either (custom Dynamo branch, different SGLang base, etc.), see recipes/deepseek-v4/container/README.md.
DeepGEMM / FlashInfer warmup. SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 + SGLANG_JIT_DEEPGEMM_FAST_WARMUP=1 skip the slow precompile and use the fast warmup path. --disable-flashinfer-autotune skips per-shape FlashInfer autotuning at startup; the dsv4 base ships pre-tuned defaults.
NCCL / Gloo. NCCL_CUMEM_ENABLE=1 is set for V4 NCCL collectives on Blackwell. GLOO_SOCKET_IFNAME=eth0 pins Gloo to the standard pod interface.

Sibling Recipe

DeepSeek-V4-Pro is the larger sibling (1.6T / 49B active, 1M context, 8x B200) and shares the same dsv4 vLLM and SGLang container images.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

DeepSeek-V4-Flash Recipe

Prerequisites

Quick Start

Deploy — vLLM B200 (`vllm-agg-b200`)

Deploy — vLLM GB200 (`vllm-agg-gb200`)

Deploy — SGLang B200 (`sglang-agg`)

Deploy — SGLang GB200 (`sglang-agg-gb200`)

Test the Deployment

Recipe Details

vLLM B200 (`vllm/agg_b200/deploy.yaml`)

vLLM GB200 (`vllm/agg_gb200/deploy.yaml`)

SGLang B200 (`sglang/agg/deploy.yaml`)

Model Details

Verifying Reasoning

Verifying Tool Calling

Notes

Common

vLLM-specific

SGLang-specific

Sibling Recipe

FilesExpand file tree

deepseek-v4-flash

Directory actions

More options

Directory actions

More options

Latest commit

History

deepseek-v4-flash

Folders and files

parent directory

README.md

DeepSeek-V4-Flash Recipe

Prerequisites

Quick Start

Deploy — vLLM B200 (vllm-agg-b200)

Deploy — vLLM GB200 (vllm-agg-gb200)

Deploy — SGLang B200 (sglang-agg)

Deploy — SGLang GB200 (sglang-agg-gb200)

Test the Deployment

Recipe Details

vLLM B200 (vllm/agg_b200/deploy.yaml)

vLLM GB200 (vllm/agg_gb200/deploy.yaml)

SGLang B200 (sglang/agg/deploy.yaml)

Model Details

Verifying Reasoning

Verifying Tool Calling

Notes

Common

vLLM-specific

SGLang-specific

Sibling Recipe

Deploy — vLLM B200 (`vllm-agg-b200`)

Deploy — vLLM GB200 (`vllm-agg-gb200`)

Deploy — SGLang B200 (`sglang-agg`)

Deploy — SGLang GB200 (`sglang-agg-gb200`)

vLLM B200 (`vllm/agg_b200/deploy.yaml`)

vLLM GB200 (`vllm/agg_gb200/deploy.yaml`)

SGLang B200 (`sglang/agg/deploy.yaml`)