|
| 1 | +# AICR - Critical User Journey (CUJ) 2 — EKS Inference |
| 2 | + |
| 3 | +## Assumptions |
| 4 | + |
| 5 | +* Assuming user is already authenticated to an EKS cluster with 2+ H100 (p5.48xlarge) nodes. |
| 6 | +* Values used in `--accelerated-node-selector`, `--accelerated-node-toleration`, `--system-node-toleration` flags are only for example purposes. Assuming user will update these to match their cluster. |
| 7 | + |
| 8 | +## Snapshot |
| 9 | + |
| 10 | +```shell |
| 11 | +aicr snapshot \ |
| 12 | + --namespace aicr-validation \ |
| 13 | + --node-selector nodeGroup=gpu-worker \ |
| 14 | + --toleration dedicated=worker-workload:NoSchedule \ |
| 15 | + --toleration dedicated=worker-workload:NoExecute \ |
| 16 | + --output snapshot.yaml |
| 17 | +``` |
| 18 | + |
| 19 | +## Gen Recipe |
| 20 | + |
| 21 | +```shell |
| 22 | +aicr recipe \ |
| 23 | + --service eks \ |
| 24 | + --accelerator h100 \ |
| 25 | + --intent inference \ |
| 26 | + --os ubuntu \ |
| 27 | + --platform dynamo \ |
| 28 | + --output recipe.yaml |
| 29 | +``` |
| 30 | + |
| 31 | +## Validate Recipe Constraints |
| 32 | + |
| 33 | +```shell |
| 34 | +aicr validate \ |
| 35 | + --recipe recipe.yaml \ |
| 36 | + --snapshot snapshot.yaml \ |
| 37 | + --no-cluster \ |
| 38 | + --phase deployment \ |
| 39 | + --output dry-run.json |
| 40 | +``` |
| 41 | + |
| 42 | +## Generate Bundle |
| 43 | + |
| 44 | +```shell |
| 45 | +aicr bundle \ |
| 46 | + --recipe recipe.yaml \ |
| 47 | + --accelerated-node-selector nodeGroup=gpu-worker \ |
| 48 | + --accelerated-node-toleration dedicated=worker-workload:NoSchedule \ |
| 49 | + --accelerated-node-toleration dedicated=worker-workload:NoExecute \ |
| 50 | + --system-node-selector nodeGroup=system-worker \ |
| 51 | + --system-node-toleration dedicated=system-workload:NoSchedule \ |
| 52 | + --system-node-toleration dedicated=system-workload:NoExecute \ |
| 53 | + --output bundle |
| 54 | +``` |
| 55 | + |
| 56 | +> Both options allow for comma-separated values to supply multiple values. See the [bundle](../docs/user/cli-reference.md#aicr-bundle) section for more information. |
| 57 | +
|
| 58 | +## Install Bundle into the Cluster |
| 59 | + |
| 60 | +```shell |
| 61 | +cd ./bundle && chmod +x deploy.sh && ./deploy.sh |
| 62 | +``` |
| 63 | + |
| 64 | +## Validate Cluster |
| 65 | + |
| 66 | +```shell |
| 67 | +aicr validate \ |
| 68 | + --recipe recipe.yaml \ |
| 69 | + --toleration dedicated=worker-workload:NoSchedule \ |
| 70 | + --toleration dedicated=worker-workload:NoExecute \ |
| 71 | + --phase all \ |
| 72 | + --output report.json |
| 73 | +``` |
| 74 | + |
| 75 | +## Deploy Inference Workload |
| 76 | + |
| 77 | +Deploy an inference serving graph using the Dynamo platform: |
| 78 | + |
| 79 | +```shell |
| 80 | +# Deploy the vLLM aggregation workload (includes KAI queue + DynamoGraphDeployment) |
| 81 | +kubectl apply -f demos/workloads/inference/vllm-agg.yaml |
| 82 | + |
| 83 | +# Monitor the deployment |
| 84 | +kubectl get dynamographdeployments -n dynamo-workload |
| 85 | +kubectl get pods -n dynamo-workload -o wide -w |
| 86 | + |
| 87 | +# Verify the inference gateway routes to the workload |
| 88 | +kubectl get gateway inference-gateway -n kgateway-system |
| 89 | +kubectl get inferencepool -n dynamo-workload |
| 90 | +``` |
| 91 | + |
| 92 | +## Chat with the Model |
| 93 | + |
| 94 | +Once the workload is running, start a local chat server: |
| 95 | + |
| 96 | +```shell |
| 97 | +# Start the chat server (port-forwards to the inference gateway) |
| 98 | +bash demos/workloads/inference/chat-server.sh |
| 99 | + |
| 100 | +# Open the chat UI in your browser |
| 101 | +open demos/workloads/inference/chat.html |
| 102 | +``` |
| 103 | + |
| 104 | +## Success |
| 105 | + |
| 106 | +* Bundle deployed with 16 components (inference recipe) |
| 107 | +* CNCF conformance: 9/9 requirements pass |
| 108 | + * DRA Support, Gang Scheduling, Secure GPU Access, Accelerator Metrics, |
| 109 | + AI Service Metrics, Inference Gateway, Robust Controller (Dynamo), |
| 110 | + Pod Autoscaling (HPA), Cluster Autoscaling |
| 111 | +* Dynamo inference workload serving requests via inference gateway |
0 commit comments