Deploy and serve LLM models with NVIDIA Dynamo on Kubernetes (on-premises) and Amazon EKS.
Implemented: GPU Operator, Monitoring (Prometheus + Grafana + Dynamo/DCGM/KVBM/Benchmark Pareto dashboards), Dynamo Platform (Operator, etcd, NATS), Dynamo vLLM (aggregated/disaggregated, KV Router, KVBM), AIPerf Benchmark (concurrency sweep, multi-turn, sequence distribution, prefix cache → Pushgateway + Pareto dashboard), AIConfigurator (Quick Estimate + SLA-driven deploy via DGDR).
+-----------------------+
| CLI |
| ./cli nvidia-platform |
+----------+------------+
|
+--------+--------+--------+--------+--------+
| | | | | |
GPU Op Monitor Dynamo Dynamo Benchmark AIConfig
DCGM Prom, Platform vLLM AIPerf, Quick Est,
Grafana etcd, agg/ Pushgateway SLA Deploy
Operator disagg Pareto
| Component | Description | CLI Command |
|---|---|---|
| GPU Operator | NVIDIA GPU resource management | ./cli nvidia-platform gpu-operator install |
| Monitoring | Prometheus + Grafana + Dynamo/DCGM/KVBM/Benchmark dashboards | ./cli nvidia-platform monitoring install |
| Dynamo Platform | CRDs, Operator, etcd, NATS, Grove, KAI Scheduler | ./cli nvidia-platform dynamo-platform install |
| Dynamo vLLM Serving | vLLM model deployment (agg/disagg, KV Router, KVBM) | ./cli nvidia-platform dynamo-vllm install |
| AIPerf Benchmark | Concurrency sweep, multi-turn, seq distribution, prefix cache → Pushgateway + Pareto dashboard | ./cli nvidia-platform benchmark install |
| AIConfigurator | TP/PP recommendation (Quick Estimate) + SLA-driven profile + plan + deploy (DGDR) | ./cli nvidia-platform aiconfigurator install |
# === Standard Path ===
# 0. (K8s only) Install Ingress controller if not present
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx --force-update
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx --create-namespace --set controller.service.type=NodePort --wait
# EKS: AWS Load Balancer Controller should be pre-installed (EKS addon or Terraform)
# 1. Monitoring (Prometheus + Grafana) — install first so GPU Operator
# auto-detects ServiceMonitor CRDs and enables DCGM metrics scraping.
# Also auto-detects Ingress controller to expose Grafana/Prometheus.
./cli nvidia-platform monitoring install
# 2. GPU Operator (detects Prometheus → creates DCGM ServiceMonitor)
./cli nvidia-platform gpu-operator install
# 3. Dynamo Platform (auto-detects Prometheus and sets prometheusEndpoint)
./cli nvidia-platform dynamo-platform install
# 4. Deploy a model
./cli nvidia-platform dynamo-vllm install
Environment: PLATFORM=k8s in .env
| Requirement | Details |
|---|---|
| Kubernetes v1.33+ | kubeadm, kubelet, kubectl |
| NVIDIA Driver 580+ | Pre-installed on worker nodes |
| NVIDIA Container Toolkit | CDI configured (nvidia-ctk cdi generate) |
| Fabric Manager | Required for H100 SXM / NVSwitch GPUs |
| StorageClass (RWX) | For shared model cache (e.g., NFS, CephFS) |
The following are automatically installed/detected during dynamo-platform install:
| Component | Condition | Action |
|---|---|---|
local-path StorageClass |
Not found | Auto-install local-path-provisioner |
ingress-nginx |
Not found (K8s mode only) | Prompt to install |
Prometheus (prometheusEndpoint) |
Detected if monitoring installed |
Auto-configure |
| GPU Operator | Detected | Status check only (install via gpu-operator install) |
The CLI uses two different StorageClasses depending on purpose:
| StorageClass | Purpose | Access Mode | Used By |
|---|---|---|---|
nfs (config: platform.k8s.storageClass) |
Model cache PVC | ReadWriteMany | dynamo-vllm (model download) |
local-path (default for etcd/NATS) |
etcd, NATS persistence | ReadWriteOnce | dynamo-platform |
Note:
config.json'splatform.k8s.storageClassis used for model cache PVC. etcd/NATS StorageClass defaults tolocal-pathin the Dynamo Platform Helm values.
Monitoring (Ingress) — Grafana and Prometheus are exposed via Ingress if an Ingress controller is installed:
kubectl get svc -n ingress-nginx # Find NodePort
http://<node-ip>:<node-port>/grafana # Grafana
http://<node-ip>:<node-port>/prometheus # PrometheusvLLM API (port-forward) — each model is accessed via its ClusterIP service:
# In-cluster URL (for other services / LiteLLM):
http://<deployment>-frontend.dynamo-system:8000/v1
# Quick test via port-forward:
kubectl port-forward svc/<deployment>-frontend 8000:8000 -n dynamo-system --address 0.0.0.0 &
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "<served-model-name>", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50, "stream": false}'Remote access (SSH tunnel):
ssh -N -L <local-port>:<node-ip>:<node-port> <user>@<remote-host>
# Grafana: http://localhost:<local-port>/grafana
# Prometheus: http://localhost:<local-port>/prometheusEnvironment: PLATFORM=eks in .env
| Requirement | Details |
|---|---|
| EKS Cluster v1.33+ | GPU node groups (g6e, p5, p4d, etc.) |
| NVIDIA Driver 580+ | Pre-installed by EKS GPU AMI |
| HuggingFace Token | For gated model access (K8s Secret) |
| Component | Condition | Action |
|---|---|---|
| GPU Operator | gpu-operator install |
Helm install (GFD + DCGM + GDS) |
| EFS CSI Driver | Not found | Auto-install via EKS addon |
| EFS StorageClass | Not found | Prompt to select/create EFS + StorageClass |
Prometheus (prometheusEndpoint) |
Detected if monitoring installed |
Auto-configure |
The Dynamo Platform installer auto-detects and configures EFS. Manual setup:
# Create EFS
aws efs create-file-system --region $REGION --tags Key=Name,Value=dynamo-efs
# Install EFS CSI Driver
aws eks create-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-efs-csi-driver| StorageClass | Purpose | Access Mode | Backend |
|---|---|---|---|
efs |
Model cache PVC | ReadWriteMany | Amazon EFS |
Monitoring (ALB Ingress) — Grafana and Prometheus are exposed via ALB if AWS Load Balancer Controller is installed:
kubectl get ingress -n monitoring
# ADDRESS: k8s-dynamomo-xxxx.us-east-1.elb.amazonaws.com
http://<alb-url>/grafana # Grafana
http://<alb-url>/prometheus # PrometheusvLLM API (in-cluster service) — each model is accessed via ClusterIP:
# In-cluster URL (for LiteLLM / other services):
http://<deployment>-frontend.dynamo-system:8000/v1
# Port-forward for development:
kubectl port-forward svc/<deployment>-frontend 8000:8000 -n dynamo-system &Production multi-model routing — use LiteLLM proxy to expose multiple models via a single endpoint with API key management and rate limiting.
The monitoring component deploys kube-prometheus-stack (Prometheus + Grafana) with a pre-configured Dynamo Dashboard. When monitoring is installed before Dynamo Platform, the prometheusEndpoint is automatically detected and configured.
Prometheus ──► Scrapes metrics from:
├── Dynamo Frontend (:8000/metrics) ── dynamo_frontend_* metrics
├── Dynamo Workers (:9090/metrics) ── dynamo_component_* metrics (Operator auto-configures DYN_SYSTEM_PORT)
├── DCGM Exporter (:9400/metrics) ── GPU utilization, power, memory
├── Node Exporter ── CPU, memory, disk, network
└── kube-state-metrics ── K8s object states
Grafana ──► Queries Prometheus ──► Dynamo Dashboard
├── Frontend Requests/Sec
├── Time to First Token (TTFT)
├── Inter-Token Latency (ITL)
├── Request Duration
├── Input/Output Sequence Length
├── DCGM GPU Utilization
├── Node CPU & Load
├── Container CPU/Memory per Pod
└── (Custom dashboards via ConfigMap)
| Integration | How It Works |
|---|---|
| Dynamo Platform → Prometheus | --set prometheusEndpoint=... enables Dynamo Operator to auto-create PodMonitors |
| Dynamo Workers → Prometheus | Dynamo Operator auto-sets DYN_SYSTEM_PORT=9090, creates health probes and /metrics endpoint |
| DCGM → Prometheus | GPU Operator's DCGM Exporter + ServiceMonitor |
| Grafana Dashboard | ConfigMap with grafana_dashboard: "1" label, auto-discovered by sidecar |
No interactive prompts — all settings are from config.json (platform.monitoring):
| Setting | Default | Customizable via |
|---|---|---|
| Grafana password | admin |
config.json → grafanaAdminPassword |
| Retention | 7d |
config.json → retention |
| Alertmanager | false |
config.json → alertmanagerEnabled |
| Ingress | Auto-detect | Enabled if Ingress controller exists |
| ALB annotations (EKS) | Auto-added | When PLATFORM=eks and Ingress detected |
If Ingress is enabled during installation, Grafana and Prometheus are accessible via HTTP path routing through the Ingress controller. No port-forward needed.
# Find the Ingress controller's NodePort (on-prem / Vagrant)
kubectl get svc -n ingress-nginx
# Look for PORT(S): 80:<NODE_PORT>/TCP
# Find any node IP
kubectl get nodes -o wide
# Look for INTERNAL-IPAccess URLs:
| Service | URL |
|---|---|
| Grafana | http://<node-ip>:<node-port>/grafana |
| Prometheus | http://<node-ip>:<node-port>/prometheus |
Remote access (SSH tunnel):
If the K8s cluster is behind a remote host (e.g., Vagrant VMs on a remote server):
# From your local machine (single SSH tunnel is enough)
ssh -N -L <local-port>:<node-ip>:<node-port> <user>@<remote-host>
# Then open in browser
# Grafana: http://localhost:<local-port>/grafana
# Prometheus: http://localhost:<local-port>/prometheusWhen PLATFORM=eks and ALB Ingress Controller is detected, ALB annotations are automatically added. Grafana and Prometheus share a single ALB:
kubectl get ingress -n monitoring
# ADDRESS: k8s-dynamomo-xxxx.us-east-1.elb.amazonaws.com
# Access
http://<alb-url>/grafana
http://<alb-url>/prometheus# Inside the cluster (or via SSH to the node)
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring --address 0.0.0.0 &
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring --address 0.0.0.0 &
# If accessing from a remote machine, set up SSH tunnels accordingly| Field | Value |
|---|---|
| User | admin |
| Password | Configured during install (default: admin) |
To retrieve the current password:
kubectl get secret prometheus-grafana -n monitoring \
-o jsonpath="{.data.admin-password}" | base64 --decode; echo| Dashboard | Description | Source |
|---|---|---|
| Dynamo Dashboard | Frontend/Worker metrics, request rates, latencies | monitoring/dashboards/dynamo-dashboard.json |
| DCGM GPU Monitoring | GPU utilization, memory, temperature, power | monitoring/dashboards/dcgm-metrics.json |
| KVBM KV Cache | KV cache usage, offloading metrics | monitoring/dashboards/kvbm.json |
| Benchmark Pareto | Benchmark comparison (TPS/GPU, TTFT, ITL vs concurrency) | monitoring/dashboards/benchmark-dashboard.json |
Dashboards are auto-loaded via Grafana sidecar (ConfigMaps with grafana_dashboard: "1" label). To refresh dashboard JSON (e.g. after editing benchmark-dashboard.json), re-run ./cli nvidia-platform monitoring install or update the ConfigMap and restart the Grafana deployment.
Prometheus Pushgateway is installed alongside the monitoring stack to receive benchmark results. After each benchmark run, metrics are automatically pushed via kubectl port-forward + HTTP POST. To reset benchmark data in Grafana, delete Pushgateway metrics (see Managing Benchmark Data under AIPerf Benchmark) or re-install monitoring (PVCs are optional to delete for a full reset).
./cli nvidia-platform benchmark installInteractive prompts:
- Select deployed DynamoGraphDeployment
- Choose benchmark mode (Concurrency Sweep, Multi-Turn, Seq Distribution, Prefix Cache)
- Set parameters (ISL, OSL, concurrency levels, etc.)
| Mode | Description | Use Case |
|---|---|---|
| Concurrency Sweep | Throughput vs latency at different concurrency levels | Baseline performance |
| Multi-Turn | Multi-turn conversations with session affinity | KV Router cache hit effect |
| Sequence Distribution | Mixed ISL/OSL workloads (QA + summarization) | Real-world traffic simulation |
| Prefix Cache | Synthetic shared-prefix workload (no trace file needed) | KV cache hit rate testing |
Prefix Cache: request_count is set automatically to concurrency × 4 per level (aiperf requires request_count >= concurrency). No prompt for request count.
After benchmark completion:
- Results: PVC
benchmark-resultsindynamo-system(per-benchmark dirsc1,c8, …) and local copy undercomponents/nvidia-platform/benchmark/results/<benchmark-name>/. - Metrics: Automatically pushed to Prometheus Pushgateway (scraped by Prometheus).
- Grafana: Open Benchmark Pareto dashboard, select benchmarks via the Benchmark dropdown.
Grafana charts show X = concurrency (numeric order 1→8→16→32), Y = metric, each benchmark = one line (points + line):
- TPS/GPU vs Concurrency
- TPS/User vs Concurrency
- TTFT P50 / P99 vs Concurrency
- ITL P50 vs Concurrency
- Request Latency P50 vs Concurrency
- GPU Efficiency: TPS/GPU vs TPS/User (Pareto) — X = TPS/User, Y = TPS/GPU, scatter plot (points only, no lines). Top-right = optimal.
Metrics are pushed with concurrency labels zero-padded (e.g. 001, 008, 016) so Grafana sorts points in numeric order; directories are processed in concurrency order before push.
Run the same benchmark mode on different deployment configs to generate Pareto comparison data:
# 1. Deploy agg baseline → run sweep → results as "qwen3-agg"
# 2. Deploy agg + kvrouter → run sweep → results as "qwen3-kvrouter"
# 3. Deploy disagg + kvbm → run sweep → results as "qwen3-disagg"Then select all three in the Benchmark dropdown to see them on the same graph.
# Delete all Pushgateway data (via port-forward)
kubectl port-forward svc/pushgateway-prometheus-pushgateway 19091:9091 -n monitoring &
sleep 2
curl -s http://localhost:19091/metrics | grep -oP 'job="[^"]*"' | sort -u | sed 's/job="//;s/"//' | while read job; do
curl -X DELETE "http://localhost:19091/metrics/job/$job"
done
pkill -f "port-forward.*pushgateway"
# Uninstall benchmark resources (Jobs, PVC)
./cli nvidia-platform benchmark uninstall| Mode | Description | Workers |
|---|---|---|
| Aggregated (agg) | Single worker handles prefill + decode | VllmWorker |
| Disaggregated (disagg) | Separate prefill and decode workers with NIXL KV transfer | VllmPrefillWorker + VllmDecodeWorker |
| Parameter | Description | Agg | Disagg |
|---|---|---|---|
| Tensor Parallel (TP) | Split model across GPUs | Single setting | Prefill TP + Decode TP (separate) |
| Pipeline Parallel (PP) | Split layers across GPUs | Single setting | Prefill PP + Decode PP (separate) |
| Expert Parallel (EP) | MoE expert distribution | --enable-expert-parallel |
Per worker |
| Replicas | Pod-level scaling (Dynamo Frontend routes) | Worker replicas | Prefill replicas + Decode replicas |
Routes requests to workers with cached KV blocks for better TTFT.
| Setting | ENV / Arg | Default |
|---|---|---|
| Enable | DYN_ROUTER_MODE=kv |
Disabled |
| Temperature | DYN_ROUTER_TEMPERATURE |
0.5 |
| Overlap Weight | DYN_KV_OVERLAP_SCORE_WEIGHT |
1.0 |
| KV Events | --kv-events-config |
Auto when router enabled |
Offload KV cache from GPU to CPU/Disk for larger effective context.
GPU (G1) → CPU Pinned Memory (G2) → Local SSD/NVMe (G3)
| Setting | ENV | Description |
|---|---|---|
| CPU Cache | DYN_KVBM_CPU_CACHE_GB |
CPU pinned memory size (GB) |
| Disk Cache | DYN_KVBM_DISK_CACHE_GB |
SSD cache size (GB) |
| Disk Directory | DYN_KVBM_DISK_CACHE_DIR |
Cache path (default: /tmp) |
| Disk Offload Filter | DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER |
Frequency filter for SSD lifespan (default: enabled) |
| Connector | --connector kvbm (agg) / --connector kvbm nixl (disagg prefill) |
Required |
Note: In disaggregated mode, KVBM is applied to Prefill workers only. Decode workers use --connector nixl for KV transfer.
Disk Offload Frequency Filter: Only offloads blocks with frequency >= 2 to disk, protecting SSD lifespan. Frequency doubles on cache hit, decays over time (600s interval).
- Models pre-downloaded to PVC using
huggingface_hub.snapshot_download(local_dir=...) - Fixed path:
/opt/models/<org>/<model-name>/ - Auto-detection: checks if
*.safetensorsexist before download - vLLM uses
--model /opt/models/<org>/<model>(local path) +--served-model-name <HF ID>
Interactive prompts (only essential config):
? Enter model name: (Qwen/Qwen3-30B-A3B-Instruct-2507-FP8)
? Select deployment mode: Aggregated / Disaggregated
? Tensor Parallel Size (TP): 1
? Enable KV Router? No
? Enable KV Cache Offloading (KVBM)? No
? Worker replicas: 1
? Additional vLLM args: --gpu-memory-utilization 0.90 --block-size 128
? Deployment name: (qwen3-30b-a3b-instruct-25)
? What would you like to do? Deploy now / Review first / Save only
Auto-configured (no prompts, from config.json or auto-detection):
| Setting | Source |
|---|---|
| vLLM image tag | config.json → dynamoPlatform.releaseVersion |
| Structured logging | Auto-enable if monitoring installed |
| KV Router temperature | 0.5 (Dynamo default) |
| KV overlap score weight | 1.0 (Dynamo default) |
| KVBM disk directory | /tmp (K8s) / /mnt/nvme/kvbm_cache (EKS) |
| KVBM disk offload filter | true (default) |
Automatically recommend optimal parallelization (TP/PP) and deployment configuration using NVIDIA AI Configurator simulation. Compares aggregated vs disaggregated serving and generates Pareto frontiers.
./cli nvidia-platform aiconfigurator install| Mode | Description | Duration | GPU Required | Deploys |
|---|---|---|---|---|
| Quick Estimate | aiconfigurator cli default — agg vs disagg Pareto comparison |
~25s | No | No |
| SLA-Driven Deploy | profile (AIC or real engine) + plan + deploy via DGDR | AIC ~5min / Real 2-4h | No (AIC) / Yes (Real) | Yes |
Uses aiconfigurator cli default via a dedicated pod to:
- Load model architecture from
model_configs/(pre-cached HuggingFace configs) - Sweep TP=1,2,4,8 for both agg and disagg modes
- Display Pareto frontier (ASCII art) + top configurations table
- Recommend best agg and disagg configs under SLA constraints
Example output:
Best Experiment Chosen: agg at 1604.74 tokens/s/gpu (disagg 0.72x better)
agg Top Configurations:
| Rank | tokens/s/gpu | TTFT | parallel | replicas |
| 1 | 1604.74 | 84.42 | tp4pp1 | 2 |
| 2 | 1574.83 | 86.39 | tp2pp1 | 4 |
disagg Top Configurations:
| Rank | tokens/s/gpu | TTFT | (p)parallel | (d)parallel | (p)workers | (d)workers |
| 1 | 1149.49 | 45.10 | tp1pp1 | tp2pp1 | 2 | 3 |
Note: AIConfigurator uses FP8 GEMM + FP8 KV cache by default on H100/H200 (hardware-optimal). Quantization is determined by the system/backend combination, not the model name.
Creates a DynamoGraphDeploymentRequest (DGDR) that the Dynamo Operator processes automatically.
| Method | When to Use | Duration | GPU |
|---|---|---|---|
| AIC Simulation | Model in AIC support list | ~5 min | No |
| Real Engine Profiling | Any HuggingFace model | 2-4 hours | Yes (via DGD) |
- AIC: Select GPU system → backend → model from supported list. Fast simulation, no GPU required.
- Real: Select backend → enter HuggingFace model ID (e.g.,
Qwen/Qwen3-30B-A3B-Instruct-2507-FP8). The profiler orchestrates temporary DGDs to benchmark with AIPerf.
DGDR Created → Pending → Profiling → [complete] → Deploying → Ready
│
AIC simulation or │
Real engine profiling │
▼
DGD created (independent of DGDR)
+ planner-profile-data ConfigMap
? Select mode: SLA-Driven Deploy
? Profiling method: AIC Simulation / Real Engine Profiling
? DGDR name: qwen3-30b-sla
? Auto-deploy after profiling with SLA-based planner? Yes
? Min GPUs per engine (0 = auto): 0
? Max GPUs per engine (0 = auto): 0
| Setting | Value | Reason |
|---|---|---|
| Model cache PVC | dynamo-model-cache (auto-create if missing) |
Same PVC as dynamo-vllm |
| PVC mount path | /opt/models |
Matches dynamo-vllm mount path |
| Model path in PVC | <model_id> (e.g., Qwen/Qwen3-...) |
mountPath/pvcPath = /opt/models/<model> |
| Discovery backend | etcd (via DGD annotation) |
Required for KVBM handshake stability |
| SLA Planner min endpoints | 1 |
Minimum 1 prefill + 1 decode replica |
| SLA Planner adjustment interval | 60s |
Scaling check frequency |
| Profiling job resources | 2-4 CPU, 8-16Gi memory | Profiler is orchestrator only, no GPU needed |
| Profiling job tolerations | nvidia.com/gpu: NoSchedule |
Schedule on GPU-tainted nodes |
- Immutable: once profiling starts, spec cannot be changed. Create a new DGDR to change config.
- DGD independence: DGD is NOT owned by DGDR. Deleting DGDR does not delete the DGD (protects serving traffic).
- ConfigMap persistence:
profiling-output-<dgdr>andplanner-profile-dataConfigMaps survive DGDR deletion. - Re-deploy from ConfigMap: extract DGD YAML from ConfigMap and
kubectl applywithout re-profiling:
kubectl get cm profiling-output-<dgdr-name> -n dynamo-system \
-o jsonpath='{.data.config_with_planner\.yaml}' > my-dgd.yaml
kubectl apply -f my-dgd.yaml -n dynamo-system| State | Description |
|---|---|
| Pending | Spec validated, preparing profiling job |
| Profiling | Profiling job running |
| Deploying | autoApply=true, creating DGD |
| Ready | DGD deployed successfully (or spec generated if autoApply=false) |
| DeploymentDeleted | DGD was manually deleted; create new DGDR to redeploy |
| Failed | Error at any stage |
| GPU System | vLLM | TRT-LLM | SGLang |
|---|---|---|---|
| H100 SXM | 0.12.0 | 1.0.0rc3, 1.2.0rc5 | 0.5.6.post2 |
| H200 SXM | 0.12.0 | 1.0.0rc3, 1.2.0rc5 | 0.5.6.post2 |
| A100 SXM | 0.12.0 | 1.0.0 | — |
| B200 SXM | — | 1.0.0rc3, 1.2.0rc5 | 0.5.6.post2 |
| GB200 SXM | — | 1.0.0rc3, 1.2.0rc5 | — |
| L40S | — | — | — |
Model list is dynamically retrieved from the model_configs/ directory inside the aiconfigurator package. Supports 22+ models including Qwen, Llama, Mixtral, DeepSeek, Nemotron families. FP8 quantized variants are also available (e.g., Qwen/Qwen3-32B-FP8).
For Real Engine Profiling, any HuggingFace model ID can be used regardless of the AIC support list.
# Port-forward the frontend service
kubectl port-forward svc/<deployment>-frontend 8000:8000 -n dynamo-system --address 0.0.0.0 &
# Auto-detect model name
export MODEL=$(curl -s localhost:8000/v1/models | python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print(d[0]['id'] if d else 'NONE')")
echo "Model: $MODEL"
# Chat completion
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"messages\": [{\"role\": \"user\", \"content\": \"Hello! Who are you?\"}],
\"max_tokens\": 100,
\"stream\": false
}"| Issue | Details | Workaround |
|---|---|---|
| CRD chart vs Platform chart version mismatch | CRD chart is v0.9.0, Platform chart is v0.9.0-post1. -postN suffix only applies to the platform chart. |
CLI strips -postN suffix when fetching CRD chart. |
| DynamoWorkerMetadata CRD missing | dynamo-crds-0.9.0.tgz does not include DynamoWorkerMetadata CRD, which the 0.9.0-post1 operator requires. |
CLI applies bundled CRD from crds/nvidia.com_dynamoworkermetadatas.yaml after Helm install. |
| K8s-native discovery + KVBM handshake | discoveryBackend: kubernetes causes KVBM handshake failures in multi-replica disagg deployments. |
Platform defaults to discoveryBackend: etcd. DGDR template includes nvidia.com/dynamo-discovery-backend: etcd annotation. |
| Issue | Details | Workaround |
|---|---|---|
| Too many open files (os error 24) | Container soft ulimit is 1024 (hard: 524288) despite k3s LimitNOFILE=1048576. Frontend fails under high concurrency. | Not an issue on EKS (Amazon Linux default nofile is 65536+). For Vagrant, configure containerd default ulimits. |
| Long vLLM warmup | DeepGEMM warmup can take 7+ minutes for large models (e.g., Qwen3-30B). Frontend shows KVBM handshake retries during this time. | Normal behavior. Frontend will complete handshake once all workers finish warmup. |
-
GPU Operator Installer -
./cli nvidia-platform gpu-operator install -
Dynamo Platform Installer -
./cli nvidia-platform dynamo-platform install(CRDs, Operator, etcd, NATS, Grove, KAI Scheduler) -
Model PVC - K8s (NFS) - Host NFS server + nfs-subdir-external-provisioner + ReadWriteMany PVC
-
Model PVC - AWS (EFS) - EFS CSI driver + StorageClass auto-setup
-
vLLM Multi-node Aggregated Serving - Agg mode with replicas, KV Router, model pre-download
-
vLLM KV Cache Routing -
DYN_ROUTER_MODE=kvwith temperature and overlap score weight -
vLLM KV Cache Offloading (G1-G3) - CPU + Disk offloading with frequency filter for SSD lifespan
-
vLLM Multi-node Disaggregated Serving - Separate Prefill/Decode workers with independent TP/PP, NIXL connector, KVBM on Prefill only
-
Expert Parallel (EP) -
--enable-expert-parallelfor MoE models (DeepSeek-R1, Mixtral, etc.) -
EKS Mode Support - ALB Ingress, EFS, EKS tolerations, instance family nodeSelector, LiteLLM integration
-
Monitoring (Prometheus + Grafana + Dynamo Dashboard) - kube-prometheus-stack, PodMonitor auto-detection, DCGM ServiceMonitor, Grafana Dynamo Dashboard, Ingress support
-
Ingress Support - Auto-detect Ingress controller, expose Grafana/Prometheus/vLLM API via path routing (no port-forward needed)
-
AIPerf Benchmark - K8s Job-based concurrency sweep, multi-turn, seq distribution, prefix cache, results extraction
-
Benchmark Pareto Dashboard - Pushgateway + Grafana XY Chart, multi-benchmark comparison, auto-push after benchmark
-
AIConfigurator - Quick Estimate (TP/PP recommendation) + SLA-Driven Deploy (DGDR auto-profile + plan + deploy)
- Log Aggregation (Loki + Alloy) - Structured log collection with Grafana Loki for DynamoGraphDeployments
- Distributed Tracing - OpenTelemetry integration with Tempo for request tracing across Frontend/Workers
- TRT-LLM Backend - TensorRT-LLM serving support (
dynamo-trtllmcomponent) - Multi-modal Model Support