Production-ready deployments for DeepSeek-R1 (671B MoE) across multiple backends and hardware configurations.
| Configuration | GPUs | Backend | Mode | Description |
|---|---|---|---|---|
| sglang/disagg-8gpu | 16x H200 | SGLang | Disaggregated WideEP | TP=8 per worker, single-node |
| sglang/disagg-16gpu | 32x H200 | SGLang | Disaggregated WideEP | TP=16 per worker, multi-node |
| trtllm/disagg/wide_ep/gb200 | 36x GB200 | TensorRT-LLM | Disaggregated WideEP | 8 decode + 1 prefill nodes |
| vllm/disagg | 32x H200 | vLLM | Disaggregated DEP16 | Multi-node, data-expert parallel |
- Dynamo Platform installed — See Kubernetes Deployment Guide
- GPU cluster with H200 or GB200 GPUs matching the configuration requirements
- HuggingFace token with access to DeepSeek models
- High-bandwidth networking — InfiniBand or RoCE recommended for multi-node deployments
# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}
# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="your-token-here" \
-n ${NAMESPACE}
# Download model (update storageClassName in model-cache.yaml first!)
# For SGLang deployments:
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download-sglang.yaml -n ${NAMESPACE}
# For vLLM/TRT-LLM deployments:
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}
# Wait for download (this is a large model - may take 1+ hours)
# For SGLang: kubectl wait --for=condition=Complete job/model-download-sglang ...
# For vLLM/TRT-LLM: kubectl wait --for=condition=Complete job/model-download ...
kubectl wait --for=condition=Complete job/model-download-sglang -n ${NAMESPACE} --timeout=7200s
# Deploy (choose one configuration)
kubectl apply -f sglang/disagg-8gpu/deploy.yaml -n ${NAMESPACE}# Port-forward the frontend (service name varies by deployment)
kubectl port-forward svc/sgl-dsr1-8gpu-frontend 8000:8000 -n ${NAMESPACE}
# Send a test request
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'- Model:
deepseek-ai/DeepSeek-R1 - Architecture: 671B parameter Mixture-of-Experts (MoE)
- Active parameters: ~37B per token
- Recommended: FP8 quantization for production deployments
DeepSeek-R1 is a very large model requiring significant GPU memory:
| Configuration | Min GPU Memory | Recommended |
|---|---|---|
| 16x H200 (SGLang TP=8) | 1.1TB total | H200 SXM (141GB each) |
| 32x H200 (SGLang TP=16, vLLM) | 2.2TB total | H200 SXM (141GB each) |
| 36x GB200 (TRT-LLM) | ~2.5TB total | GB200 NVL72 |
- Model download time: DeepSeek-R1 is ~1.3TB; expect 1-2 hours for download
- NCCL errors: Usually indicate OOM. Reduce
--mem-fraction-staticin worker args - Multi-node: Requires InfiniBand/IBGDA enabled. See vLLM EP docs
- Storage class: Update
storageClassNameinmodel-cache/model-cache.yamlbefore deploying
- Uses WideEP (Wide Expert Parallel) for efficient MoE inference
- See sglang/README.md for SGLang-specific configuration
- Requires FP4 quantized checkpoint
- GB200-specific optimizations
- Uses DEP (Data-Expert Parallel) with hybrid load balancing
- See vllm/disagg/README.md for detailed setup