Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

DeepSeek-R1 Recipes

Production-ready deployments for DeepSeek-R1 (671B MoE) across multiple backends and hardware configurations.

Available Configurations

Configuration GPUs Backend Mode Description
sglang/disagg-8gpu 16x H200 SGLang Disaggregated WideEP TP=8 per worker, single-node
sglang/disagg-16gpu 32x H200 SGLang Disaggregated WideEP TP=16 per worker, multi-node
trtllm/disagg/wide_ep/gb200 36x GB200 TensorRT-LLM Disaggregated WideEP 8 decode + 1 prefill nodes
vllm/disagg 32x H200 vLLM Disaggregated DEP16 Multi-node, data-expert parallel

Prerequisites

  1. Dynamo Platform installed — See Kubernetes Deployment Guide
  2. GPU cluster with H200 or GB200 GPUs matching the configuration requirements
  3. HuggingFace token with access to DeepSeek models
  4. High-bandwidth networking — InfiniBand or RoCE recommended for multi-node deployments

Quick Start

# Set namespace
export NAMESPACE=dynamo-demo
kubectl create namespace ${NAMESPACE}

# Create HuggingFace token secret
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN="your-token-here" \
  -n ${NAMESPACE}

# Download model (update storageClassName in model-cache.yaml first!)
# For SGLang deployments:
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download-sglang.yaml -n ${NAMESPACE}

# For vLLM/TRT-LLM deployments:
kubectl apply -f model-cache/model-cache.yaml -n ${NAMESPACE}
kubectl apply -f model-cache/model-download.yaml -n ${NAMESPACE}

# Wait for download (this is a large model - may take 1+ hours)
# For SGLang: kubectl wait --for=condition=Complete job/model-download-sglang ...
# For vLLM/TRT-LLM: kubectl wait --for=condition=Complete job/model-download ...
kubectl wait --for=condition=Complete job/model-download-sglang -n ${NAMESPACE} --timeout=7200s

# Deploy (choose one configuration)
kubectl apply -f sglang/disagg-8gpu/deploy.yaml -n ${NAMESPACE}

Test the Deployment

# Port-forward the frontend (service name varies by deployment)
kubectl port-forward svc/sgl-dsr1-8gpu-frontend 8000:8000 -n ${NAMESPACE}

# Send a test request
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Model Details

  • Model: deepseek-ai/DeepSeek-R1
  • Architecture: 671B parameter Mixture-of-Experts (MoE)
  • Active parameters: ~37B per token
  • Recommended: FP8 quantization for production deployments

Hardware Requirements

DeepSeek-R1 is a very large model requiring significant GPU memory:

Configuration Min GPU Memory Recommended
16x H200 (SGLang TP=8) 1.1TB total H200 SXM (141GB each)
32x H200 (SGLang TP=16, vLLM) 2.2TB total H200 SXM (141GB each)
36x GB200 (TRT-LLM) ~2.5TB total GB200 NVL72

Notes

  • Model download time: DeepSeek-R1 is ~1.3TB; expect 1-2 hours for download
  • NCCL errors: Usually indicate OOM. Reduce --mem-fraction-static in worker args
  • Multi-node: Requires InfiniBand/IBGDA enabled. See vLLM EP docs
  • Storage class: Update storageClassName in model-cache/model-cache.yaml before deploying

Backend-Specific Notes

SGLang

  • Uses WideEP (Wide Expert Parallel) for efficient MoE inference
  • See sglang/README.md for SGLang-specific configuration

TensorRT-LLM

  • Requires FP4 quantized checkpoint
  • GB200-specific optimizations

vLLM

  • Uses DEP (Data-Expert Parallel) with hybrid load balancing
  • See vllm/disagg/README.md for detailed setup