Skip to content

Commit 0d302df

Browse files
authored
Merge branch 'main' into feat/chainsaw-validation
2 parents b975bcc + ccca286 commit 0d302df

File tree

3 files changed

+102
-158
lines changed

3 files changed

+102
-158
lines changed

CHANGELOG.md

Lines changed: 73 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,84 @@
22

33
All notable changes to this project will be documented in this file.
44

5-
## [0.7.6] - 2026-02-21
5+
## [0.7.7] - 2026-02-24
66

7-
### Refactor
7+
### Bug Fixes
88

9-
- Codebase consistency fixes and test coverage by [@mchmarny](https://github.com/mchmarny)
9+
- Resolve gosec lint issues and bump golangci-lint to v2.10.1 by [@mchmarny](https://github.com/mchmarny)
10+
- Guard against empty path in NewFileReader after filepath.Clean by [@mchmarny](https://github.com/mchmarny)
11+
- Pass cluster K8s version to Helm SDK chart rendering by [@mchmarny](https://github.com/mchmarny)
12+
- *(e2e)* Update deploy-agent test for current snapshot CLI by [@mchmarny](https://github.com/mchmarny)
13+
- Prevent snapshot agent Job from nesting agent deployment by [@mchmarny](https://github.com/mchmarny)
14+
15+
### Build
16+
17+
- Release v0.7.7 by [@mchmarny](https://github.com/mchmarny)
18+
19+
### CI/CD
20+
21+
- Harden workflows and reduce duplication by [@mchmarny](https://github.com/mchmarny)
22+
23+
### Features
24+
25+
- *(ci)* Add metrics-driven cluster autoscaling validation with Karpenter + KWOK by [@dims](https://github.com/dims)
26+
- *(validator)* Add Go-based CNCF AI conformance checks by [@dims](https://github.com/dims)
27+
- *(validator)* Self-contained DRA conformance check with EKS overlays by [@dims](https://github.com/dims)
28+
- *(validator)* Self-contained gang scheduling conformance check by [@dims](https://github.com/dims)
29+
- *(validator)* Upgrade conformance checks from static to behavioral validation by [@dims](https://github.com/dims)
30+
- Add conformance evidence renderer and fix check false-positives by [@dims](https://github.com/dims)
31+
- *(validator)* Replace helm CLI subprocess with Helm Go SDK for chart rendering by [@xdu31](https://github.com/xdu31)
32+
- Add HPA pod autoscaling evidence for CNCF AI Conformance by [@yuanchen8911](https://github.com/yuanchen8911)
33+
- *(collector)* Add Helm release and ArgoCD Application collectors by [@mchmarny](https://github.com/mchmarny)
34+
- Add cluster autoscaling evidence for CNCF AI Conformance by [@yuanchen8911](https://github.com/yuanchen8911)
35+
- *(ci)* Binary attestation with SLSA Build Provenance v1 by [@lockwobr](https://github.com/lockwobr)
1036

1137
### Tasks
1238

39+
- *(ci)* Remove redundant DRA test steps from inference workflow by [@dims](https://github.com/dims)
40+
- Upgrade Go to 1.26.0 by [@mchmarny](https://github.com/mchmarny)
41+
- *(validator)* Remove Job-based checks from readiness phase, keep constraint-only gate by [@xdu31](https://github.com/xdu31)
42+
- *(recipe)* Add conformance recipe invariant tests by [@dims](https://github.com/dims)
43+
44+
## [0.7.7] - 2026-02-24
45+
46+
### Bug Fixes
47+
48+
- Resolve gosec lint issues and bump golangci-lint to v2.10.1 by [@mchmarny](https://github.com/mchmarny)
49+
- Guard against empty path in NewFileReader after filepath.Clean by [@mchmarny](https://github.com/mchmarny)
50+
- Pass cluster K8s version to Helm SDK chart rendering by [@mchmarny](https://github.com/mchmarny)
51+
- *(e2e)* Update deploy-agent test for current snapshot CLI by [@mchmarny](https://github.com/mchmarny)
52+
- Prevent snapshot agent Job from nesting agent deployment by [@mchmarny](https://github.com/mchmarny)
53+
54+
### CI/CD
55+
56+
- Harden workflows and reduce duplication by [@mchmarny](https://github.com/mchmarny)
57+
58+
### Features
59+
60+
- *(ci)* Add metrics-driven cluster autoscaling validation with Karpenter + KWOK by [@dims](https://github.com/dims)
61+
- *(validator)* Add Go-based CNCF AI conformance checks by [@dims](https://github.com/dims)
62+
- *(validator)* Self-contained DRA conformance check with EKS overlays by [@dims](https://github.com/dims)
63+
- *(validator)* Self-contained gang scheduling conformance check by [@dims](https://github.com/dims)
64+
- *(validator)* Upgrade conformance checks from static to behavioral validation by [@dims](https://github.com/dims)
65+
- Add conformance evidence renderer and fix check false-positives by [@dims](https://github.com/dims)
66+
- *(validator)* Replace helm CLI subprocess with Helm Go SDK for chart rendering by [@xdu31](https://github.com/xdu31)
67+
- Add HPA pod autoscaling evidence for CNCF AI Conformance by [@yuanchen8911](https://github.com/yuanchen8911)
68+
- *(collector)* Add Helm release and ArgoCD Application collectors by [@mchmarny](https://github.com/mchmarny)
69+
- Add cluster autoscaling evidence for CNCF AI Conformance by [@yuanchen8911](https://github.com/yuanchen8911)
70+
71+
### Tasks
72+
73+
- *(recipe)* Add conformance recipe invariant tests by [@dims](https://github.com/dims)
74+
- *(validator)* Remove Job-based checks from readiness phase, keep constraint-only gate by [@xdu31](https://github.com/xdu31)
75+
- *(ci)* Remove redundant DRA test steps from inference workflow by [@dims](https://github.com/dims)
76+
- Upgrade Go to 1.26.0 by [@mchmarny](https://github.com/mchmarny)
77+
78+
## [0.7.6] - 2026-02-21
79+
80+
### Tasks
81+
82+
- Codebase consistency fixes and test coverage by [@mchmarny](https://github.com/mchmarny)
1383
- Rename cleanup by [@mchmarny](https://github.com/mchmarny)
1484
- Remove redundant local e2e script by [@mchmarny](https://github.com/mchmarny)
1585
- Remove flox environment support by [@mchmarny](https://github.com/mchmarny)

demos/cuj1.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,23 @@ aicr recipe \
1414
--output recipe.yaml
1515
```
1616

17+
## Validate Recipe Constraints
18+
19+
> Setting additional `--namespace` or `--node-selector` flag to land the agent on on the right node is OK
20+
21+
```shell
22+
aicr validate \
23+
--phase readiness \
24+
--recipe recipe.yaml
25+
```
26+
1727
## Generate Bundle
1828

19-
> Assuming user updates selectors and tolerations as needed
29+
> Setting additional `--accelerated-node-selector`, `--accelerated-node-toleration`, or `--system-node-toleration` flags to land the agent on on the right node is OK
2030
2131
```shell
2232
aicr bundle \
2333
--recipe recipe.yaml \
24-
--accelerated-node-selector nodeGroup=gpu-worker \
25-
--accelerated-node-toleration dedicated=worker-workload:NoSchedule \
26-
--accelerated-node-toleration dedicated=worker-workload:NoExecute \
27-
--system-node-toleration dedicated=system-workload:NoSchedule \
28-
--system-node-toleration dedicated=system-workload:NoExecute \
2934
--output bundle
3035
```
3136

@@ -40,10 +45,10 @@ cd ./bundle && chmod +x deploy.sh && ./deploy.sh
4045
```shell
4146
aicr validate \
4247
--recipe recipe.yaml \
48+
--output report.yaml \
4349
--phase readiness \
4450
--phase deployment \
45-
--phase conformance \
46-
--output report.yaml
51+
--phase conformance
4752
```
4853

4954
## Run Job

demos/cuj2.md

Lines changed: 16 additions & 147 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
> Assuming user is already authenticated to Kubernetes cluster
44
55
## Gen Recipe
6-
TODO: add `gb200` accelerator
6+
77
```shell
88
aicr recipe \
99
--service eks \
@@ -13,156 +13,53 @@ aicr recipe \
1313
--platform dynamo \
1414
--output recipe.yaml
1515
```
16-
Sample output
17-
```
18-
[cli] building recipe from criteria: criteria=criteria(service=eks, accelerator=h100, intent=inference, os=ubuntu, platform=dynamo)
19-
[cli] recipe generation completed: output=recipe.yaml components=16 overlays=7
20-
```
2116

2217
## Validate Recipe Constraints
2318

19+
> Setting additional `--namespace` or `--node-selector` flag to land the agent on on the right node is OK
20+
2421
```shell
2522
aicr validate \
2623
--phase readiness \
27-
--namespace gpu-operator \
28-
--node-selector nodeGroup=customer-gpu \
2924
--recipe recipe.yaml
3025
```
3126

32-
Sample output:
33-
```
34-
recipeSource: recipe.yaml
35-
snapshotSource: agent:gpu-operator/aicr-validate
36-
summary:
37-
passed: 4
38-
failed: 0
39-
skipped: 0
40-
total: 4
41-
status: pass
42-
duration: 477.583µs
43-
phases:
44-
readiness:
45-
status: pass
46-
constraints:
47-
- name: K8s.server.version
48-
expected: '>= 1.34'
49-
actual: v1.34.3-eks-ac2d5a0
50-
status: passed
51-
- name: OS.release.ID
52-
expected: ubuntu
53-
actual: ubuntu
54-
status: passed
55-
- name: OS.release.VERSION_ID
56-
expected: "24.04"
57-
actual: "24.04"
58-
status: passed
59-
- name: OS.sysctl./proc/sys/kernel/osrelease
60-
expected: '>= 6.8'
61-
actual: 6.14.0-1018-aws
62-
status: passed
63-
duration: 477.583µs
64-
```
65-
66-
> Assuming cluster meets recipe constraints
67-
6827
## Generate Bundle
6928

70-
> Assuming user updates selectors and tolerations as needed
29+
> Setting additional `--accelerated-node-selector`, `--accelerated-node-toleration`, or `--system-node-toleration` flags to land the agent on on the right node is OK
7130
7231
```shell
7332
aicr bundle \
7433
--recipe recipe.yaml \
75-
--accelerated-node-selector nodeGroup=gpu-worker \
76-
--accelerated-node-toleration dedicated=worker-workload:NoSchedule \
77-
--accelerated-node-toleration dedicated=worker-workload:NoExecute \
78-
--system-node-toleration dedicated=system-workload:NoSchedule \
79-
--system-node-toleration dedicated=system-workload:NoExecute \
8034
--output bundle
8135
```
8236

83-
Sample output:
84-
```
85-
[cli] generating bundle: deployer=helm type=Helm per-component bundle recipe=recipe.yaml output=./bundle oci=false
86-
[cli] bundle generated: type=Helm per-component bundle files=42 size_bytes=666795 duration_sec=0.053811959 output_dir=./bundle
87-
88-
Helm per-component bundle generated successfully!
89-
Output directory: ./bundle
90-
Files generated: 42
91-
92-
To deploy:
93-
1. cd ./bundle
94-
2. chmod +x deploy.sh
95-
3. ./deploy.sh
96-
```
97-
9837
## Install Bundle into the Cluster
9938

10039
```shell
101-
chmod +x deploy.sh
102-
./deploy.sh
40+
cd ./bundle && chmod +x deploy.sh && ./deploy.sh
10341
```
10442

10543
## Validate Cluster
10644

10745
```shell
10846
aicr validate \
47+
--recipe recipe.yaml \
48+
--output report.yaml \
10949
--phase readiness \
11050
--phase deployment \
111-
--phase conformance \
112-
--recipe recipe.yaml
113-
```
114-
115-
Results (TODO: add full per-component health check and AI Conformance check)
116-
117-
```
118-
recipeSource: recipe.yaml
119-
snapshotSource: agent:gpu-operator/aicr-validate
120-
summary:
121-
passed: 4
122-
failed: 0
123-
skipped: 0
124-
total: 4
125-
status: pass
126-
duration: 1.452461125s
127-
phases:
128-
conformance:
129-
status: skipped
130-
reason: conformance phase not configured in recipe
131-
duration: 9.709µs
132-
deployment:
133-
status: skipped
134-
reason: deployment phase not configured in recipe
135-
duration: 7.042µs
136-
readiness:
137-
status: pass
138-
constraints:
139-
- name: K8s.server.version
140-
expected: '>= 1.34'
141-
actual: v1.34.3-eks-ac2d5a0
142-
status: passed
143-
- name: OS.release.ID
144-
expected: ubuntu
145-
actual: ubuntu
146-
status: passed
147-
- name: OS.release.VERSION_ID
148-
expected: "24.04"
149-
actual: "24.04"
150-
status: passed
151-
- name: OS.sysctl./proc/sys/kernel/osrelease
152-
expected: '>= 6.8'
153-
actual: 6.14.0-1018-aws
154-
status: passed
155-
duration: 64µs
51+
--phase conformance
15652
```
15753

15854
## Run Inference Workload
15955

16056
### Create namespace and HuggingFace secret
16157

58+
> Set HF_TOKEN env var first
59+
16260
```shell
16361
kubectl create ns dynamo-workload
16462

165-
# Create HuggingFace token secret (set HF_TOKEN env var first)
16663
sed "s/<your-hf-token>/$HF_TOKEN/" \
16764
demos/workloads/inference/hf-token-secret.yaml | kubectl apply -f -
16865
```
@@ -171,50 +68,22 @@ sed "s/<your-hf-token>/$HF_TOKEN/" \
17168

17269
```shell
17370
kubectl apply -f demos/workloads/inference/vllm-agg.yaml
174-
175-
# Monitor deployment
176-
kubectl get dynamographdeployments -n dynamo-workload
177-
kubectl get pods -n dynamo-workload -w
17871
```
17972

180-
Wait until all pods are `Running` and ready:
181-
```
182-
NAME READY STATUS RESTARTS AGE
183-
vllm-agg-frontend-0 1/1 Running 0 2m
184-
vllm-agg-vllmdecodeworker-0 1/1 Running 0 2m
185-
```
186-
187-
### Architecture
73+
Monitor deployment, until all pods are `Running` and ready:
18874

189-
```
190-
┌─────────┐ HTTP ┌────────────────┐ NATS ┌────────────────────┐
191-
│ Client │─────────▶│ Frontend │────────▶│ VllmDecodeWorker │
192-
│ (OpenAI │ :8000 │ │ :4222 │ │
193-
│ API) │◀─────────│ vllm-runtime │◀────────│ dynamo.vllm │
194-
└─────────┘ │ Qwen3-0.6B │ │ Qwen3-0.6B │
195-
│ │ │ 1x H100 GPU │
196-
│ CPU node │ │ GPU node │
197-
└────────────────┘ └────────────────────┘
198-
ip-100-64-83-166 ip-100-64-171-120
199-
svc: :8000 svc: :9090
200-
201-
Services:
202-
Frontend 1/1 Ready componentType: frontend
203-
VllmDecodeWorker 1/1 Ready componentType: worker gpu: 1
204-
205-
Flow:
206-
1. Client sends OpenAI request (/v1/chat/completions) → Frontend :8000
207-
2. Frontend dispatches inference work via NATS :4222
208-
3. VllmDecodeWorker runs Qwen/Qwen3-0.6B on H100, returns result
209-
4. Response streams back: Worker → NATS → Frontend → Client
75+
```shell
76+
kubectl get dynamographdeployments -n dynamo-workload
77+
kubectl get pods -n dynamo-workload -w
21078
```
21179

21280
### Test the endpoint
21381

21482
#### Option 1: Chat UI (browser)
21583

84+
Launch the chat server (port-forward + local UI on port 9090)
85+
21686
```shell
217-
# Launch the chat server (port-forward + local UI on port 9090)
21887
./demos/workloads/inference/chat-server.sh
21988
```
22089

0 commit comments

Comments
 (0)