- Assuming user is already authenticated to an EKS cluster with 2+ H100 (p5.48xlarge) nodes.
- Values used in
--accelerated-node-selector,--accelerated-node-toleration,--system-node-tolerationflags are only for example purposes. Assuming user will update these to match their cluster.
aicr snapshot \
--namespace aicr-validation \
--node-selector nodeGroup=gpu-worker \
--toleration dedicated=worker-workload:NoSchedule \
--toleration dedicated=worker-workload:NoExecute \
--output snapshot.yamlaicr recipe \
--service eks \
--accelerator h100 \
--intent inference \
--os ubuntu \
--platform dynamo \
--output recipe.yamlaicr validate \
--recipe recipe.yaml \
--snapshot snapshot.yaml \
--no-cluster \
--phase deployment \
--output dry-run.jsonaicr bundle \
--recipe recipe.yaml \
--accelerated-node-selector nodeGroup=gpu-worker \
--accelerated-node-toleration dedicated=worker-workload:NoSchedule \
--accelerated-node-toleration dedicated=worker-workload:NoExecute \
--system-node-selector nodeGroup=system-worker \
--system-node-toleration dedicated=system-workload:NoSchedule \
--system-node-toleration dedicated=system-workload:NoExecute \
--output bundleBoth options allow for comma-separated values to supply multiple values. See the bundle section for more information.
cd ./bundle && chmod +x deploy.sh && ./deploy.shaicr validate \
--recipe recipe.yaml \
--toleration dedicated=worker-workload:NoSchedule \
--toleration dedicated=worker-workload:NoExecute \
--phase all \
--output report.jsonDeploy an inference serving graph using the Dynamo platform:
# Deploy the vLLM aggregation workload (includes KAI queue + DynamoGraphDeployment)
kubectl apply -f demos/workloads/inference/vllm-agg.yaml
# Monitor the deployment
kubectl get dynamographdeployments -n dynamo-workload
kubectl get pods -n dynamo-workload -o wide -w
# Verify the inference gateway routes to the workload
kubectl get gateway inference-gateway -n kgateway-system
kubectl get inferencepool -n dynamo-workloadOnce the workload is running, start a local chat server:
# Start the chat server (port-forwards to the inference gateway)
bash demos/workloads/inference/chat-server.sh
# Open the chat UI in your browser
open demos/workloads/inference/chat.html- Bundle deployed with 16 components (inference recipe)
- CNCF conformance: 9/9 requirements pass
- DRA Support, Gang Scheduling, Secure GPU Access, Accelerator Metrics, AI Service Metrics, Inference Gateway, Robust Controller (Dynamo), Pod Autoscaling (HPA), Cluster Autoscaling
- Dynamo inference workload serving requests via inference gateway