This directory contains Kubernetes Custom Resource Definition (CRD) templates for deploying SGLang inference graphs using the DynamoGraphDeployment resource.
Basic deployment pattern with frontend and a single decode worker.
Architecture:
Frontend: OpenAI-compatible API serverSGLangDecodeWorker: Single worker handling both prefill and decode
Enhanced aggregated deployment with KV cache routing capabilities.
Architecture:
Frontend: OpenAI-compatible API server with router mode enabled (--router-mode kv)SGLangDecodeWorker: Single worker handling both prefill and decode
High-performance deployment with separated prefill and decode workers.
Architecture:
Frontend: HTTP API server coordinating between workersSGLangDecodeWorker: Specialized decode-only worker (--disaggregation-mode decode)SGLangPrefillWorker: Specialized prefill-only worker (--disaggregation-mode prefill)- Communication via NIXL transfer backend (
--disaggregation-transfer-backend nixl)
All templates use the DynamoGraphDeployment CRD:
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: <deployment-name>
spec:
services:
<ServiceName>:
# Service configurationResource Management:
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"Container Configuration:
extraPodSpec:
mainContainer:
image: my-registry/sglang-runtime:my-tag
workingDir: /workspace/components/backends/sglang
args:
- "python3"
- "-m"
- "dynamo.sglang.worker"
# Model-specific argumentsBefore using these templates, ensure you have:
- Dynamo Cloud Platform installed - See Installing Dynamo Cloud
- Kubernetes cluster with GPU support
- Container registry access for SGLang runtime images
- HuggingFace token secret (referenced as
envFromSecret: hf-token-secret)
Select the deployment pattern that matches your requirements:
- Use
agg.yamlfor development/testing - Use
agg_router.yamlfor production with load balancing - Use
disagg.yamlfor maximum performance
Edit the template to match your environment:
# Update image registry and tag
image: your-registry/sglang-runtime:your-tag
# Configure your model
args:
- "--model-path"
- "your-org/your-model"
- "--served-model-name"
- "your-org/your-model"kubectl apply -f <your-template>.yamlAll templates use DeepSeek-R1-Distill-Llama-8B as the default model. But you can use any sglang argument and configuration. Key parameters:
- Frontend health endpoint:
http://<frontend-service>:8000/health - Liveness probes: Check process health every 60s
- Deployment Guide: Creating Kubernetes Deployments
- Quickstart: Deployment Quickstart
- Platform Setup: Dynamo Cloud Installation
- Examples: Deployment Examples
- Kubernetes CRDs: Custom Resources Documentation
Common issues and solutions:
- Pod fails to start: Check image registry access and HuggingFace token secret
- GPU not allocated: Verify cluster has GPU nodes and proper resource limits
- Health check failures: Review model loading logs and increase
initialDelaySeconds - Out of memory: Increase memory limits or reduce model batch size
For additional support, refer to the deployment troubleshooting guide.