Merge branch 'main' into feat/chainsaw-validation

mchmarny · web-flow · commit 0d302df81bba · 2026-02-24T05:58:32.000-08:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,14 +2,84 @@
 
 All notable changes to this project will be documented in this file.
 
-## [0.7.6] - 2026-02-21
+## [0.7.7] - 2026-02-24
 
-### Refactor
+### Bug Fixes
 
-- Codebase consistency fixes and test coverage  by [@mchmarny](https://github.com/mchmarny)
+- Resolve gosec lint issues and bump golangci-lint to v2.10.1 by [@mchmarny](https://github.com/mchmarny)
+- Guard against empty path in NewFileReader after filepath.Clean by [@mchmarny](https://github.com/mchmarny)
+- Pass cluster K8s version to Helm SDK chart rendering  by [@mchmarny](https://github.com/mchmarny)
+- *(e2e)* Update deploy-agent test for current snapshot CLI  by [@mchmarny](https://github.com/mchmarny)
+- Prevent snapshot agent Job from nesting agent deployment  by [@mchmarny](https://github.com/mchmarny)
+
+### Build
+
+- Release v0.7.7 by [@mchmarny](https://github.com/mchmarny)
+
+### CI/CD
+
+- Harden workflows and reduce duplication  by [@mchmarny](https://github.com/mchmarny)
+
+### Features
+
+- *(ci)* Add metrics-driven cluster autoscaling validation with Karpenter + KWOK  by [@dims](https://github.com/dims)
+- *(validator)* Add Go-based CNCF AI conformance checks  by [@dims](https://github.com/dims)
+- *(validator)* Self-contained DRA conformance check with EKS overlays  by [@dims](https://github.com/dims)
+- *(validator)* Self-contained gang scheduling conformance check  by [@dims](https://github.com/dims)
+- *(validator)* Upgrade conformance checks from static to behavioral validation  by [@dims](https://github.com/dims)
+- Add conformance evidence renderer and fix check false-positives  by [@dims](https://github.com/dims)
+- *(validator)* Replace helm CLI subprocess with Helm Go SDK for chart rendering  by [@xdu31](https://github.com/xdu31)
+- Add HPA pod autoscaling evidence for CNCF AI Conformance  by [@yuanchen8911](https://github.com/yuanchen8911)
+- *(collector)* Add Helm release and ArgoCD Application collectors  by [@mchmarny](https://github.com/mchmarny)
+- Add cluster autoscaling evidence for CNCF AI Conformance  by [@yuanchen8911](https://github.com/yuanchen8911)
+- *(ci)* Binary attestation with SLSA Build Provenance v1  by [@lockwobr](https://github.com/lockwobr)
 
 ### Tasks
 
+- *(ci)* Remove redundant DRA test steps from inference workflow  by [@dims](https://github.com/dims)
+- Upgrade Go to 1.26.0  by [@mchmarny](https://github.com/mchmarny)
+- *(validator)* Remove Job-based checks from readiness phase, keep constraint-only gate  by [@xdu31](https://github.com/xdu31)
+- *(recipe)* Add conformance recipe invariant tests  by [@dims](https://github.com/dims)
+
+## [0.7.7] - 2026-02-24
+
+### Bug Fixes
+
+- Resolve gosec lint issues and bump golangci-lint to v2.10.1 by [@mchmarny](https://github.com/mchmarny)
+- Guard against empty path in NewFileReader after filepath.Clean by [@mchmarny](https://github.com/mchmarny)
+- Pass cluster K8s version to Helm SDK chart rendering  by [@mchmarny](https://github.com/mchmarny)
+- *(e2e)* Update deploy-agent test for current snapshot CLI  by [@mchmarny](https://github.com/mchmarny)
+- Prevent snapshot agent Job from nesting agent deployment  by [@mchmarny](https://github.com/mchmarny)
+
+### CI/CD
+
+- Harden workflows and reduce duplication  by [@mchmarny](https://github.com/mchmarny)
+
+### Features
+
+- *(ci)* Add metrics-driven cluster autoscaling validation with Karpenter + KWOK  by [@dims](https://github.com/dims)
+- *(validator)* Add Go-based CNCF AI conformance checks  by [@dims](https://github.com/dims)
+- *(validator)* Self-contained DRA conformance check with EKS overlays  by [@dims](https://github.com/dims)
+- *(validator)* Self-contained gang scheduling conformance check  by [@dims](https://github.com/dims)
+- *(validator)* Upgrade conformance checks from static to behavioral validation  by [@dims](https://github.com/dims)
+- Add conformance evidence renderer and fix check false-positives  by [@dims](https://github.com/dims)
+- *(validator)* Replace helm CLI subprocess with Helm Go SDK for chart rendering  by [@xdu31](https://github.com/xdu31)
+- Add HPA pod autoscaling evidence for CNCF AI Conformance  by [@yuanchen8911](https://github.com/yuanchen8911)
+- *(collector)* Add Helm release and ArgoCD Application collectors  by [@mchmarny](https://github.com/mchmarny)
+- Add cluster autoscaling evidence for CNCF AI Conformance  by [@yuanchen8911](https://github.com/yuanchen8911)
+
+### Tasks
+
+- *(recipe)* Add conformance recipe invariant tests  by [@dims](https://github.com/dims)
+- *(validator)* Remove Job-based checks from readiness phase, keep constraint-only gate  by [@xdu31](https://github.com/xdu31)
+- *(ci)* Remove redundant DRA test steps from inference workflow  by [@dims](https://github.com/dims)
+- Upgrade Go to 1.26.0  by [@mchmarny](https://github.com/mchmarny)
+
+## [0.7.6] - 2026-02-21
+
+### Tasks
+
+- Codebase consistency fixes and test coverage  by [@mchmarny](https://github.com/mchmarny)
 - Rename cleanup by [@mchmarny](https://github.com/mchmarny)
 - Remove redundant local e2e script by [@mchmarny](https://github.com/mchmarny)
 - Remove flox environment support by [@mchmarny](https://github.com/mchmarny)
diff --git a/demos/cuj1.md b/demos/cuj1.md
@@ -14,18 +14,23 @@ aicr recipe \
   --output recipe.yaml
 ```
 
+## Validate Recipe Constraints
+
+> Setting additional `--namespace` or `--node-selector` flag to land the agent on on the right node is OK
+
+```shell
+aicr validate \
+  --phase readiness \
+  --recipe recipe.yaml
+```
+
 ## Generate Bundle
 
-> Assuming user updates selectors and tolerations as needed
+> Setting additional `--accelerated-node-selector`, `--accelerated-node-toleration`, or `--system-node-toleration` flags to land the agent on on the right node is OK
 
 ```shell
 aicr bundle \
   --recipe recipe.yaml \
-  --accelerated-node-selector nodeGroup=gpu-worker \
-  --accelerated-node-toleration dedicated=worker-workload:NoSchedule \
-  --accelerated-node-toleration dedicated=worker-workload:NoExecute \
-  --system-node-toleration dedicated=system-workload:NoSchedule \
-  --system-node-toleration dedicated=system-workload:NoExecute \
   --output bundle
 ```
 
@@ -40,10 +45,10 @@ cd ./bundle && chmod +x deploy.sh && ./deploy.sh
 ```shell
 aicr validate \
   --recipe recipe.yaml \
+  --output report.yaml \
   --phase readiness \
   --phase deployment \
-  --phase conformance \
-  --output report.yaml
+  --phase conformance
 ```
 
 ## Run Job
diff --git a/demos/cuj2.md b/demos/cuj2.md
@@ -3,7 +3,7 @@
 > Assuming user is already authenticated to Kubernetes cluster
 
 ## Gen Recipe
-TODO: add `gb200` accelerator
+
 ```shell
 aicr recipe \
   --service eks \
@@ -13,156 +13,53 @@ aicr recipe \
   --platform dynamo \
   --output recipe.yaml
 ```
-Sample output
-```
-[cli] building recipe from criteria: criteria=criteria(service=eks, accelerator=h100, intent=inference, os=ubuntu, platform=dynamo)
-[cli] recipe generation completed: output=recipe.yaml components=16 overlays=7
-```
 
 ## Validate Recipe Constraints
 
+> Setting additional `--namespace` or `--node-selector` flag to land the agent on on the right node is OK
+
 ```shell
 aicr validate \
   --phase readiness \
-  --namespace gpu-operator \
-  --node-selector nodeGroup=customer-gpu \
   --recipe recipe.yaml
 ```
 
-Sample output:
-```
-recipeSource: recipe.yaml
-snapshotSource: agent:gpu-operator/aicr-validate
-summary:
-  passed: 4
-  failed: 0
-  skipped: 0
-  total: 4
-  status: pass
-  duration: 477.583µs
-phases:
-  readiness:
-    status: pass
-    constraints:
-      - name: K8s.server.version
-        expected: '>= 1.34'
-        actual: v1.34.3-eks-ac2d5a0
-        status: passed
-      - name: OS.release.ID
-        expected: ubuntu
-        actual: ubuntu
-        status: passed
-      - name: OS.release.VERSION_ID
-        expected: "24.04"
-        actual: "24.04"
-        status: passed
-      - name: OS.sysctl./proc/sys/kernel/osrelease
-        expected: '>= 6.8'
-        actual: 6.14.0-1018-aws
-        status: passed
-    duration: 477.583µs
-```
-
-> Assuming cluster meets recipe constraints
-
 ## Generate Bundle
 
-> Assuming user updates selectors and tolerations as needed
+> Setting additional `--accelerated-node-selector`, `--accelerated-node-toleration`, or `--system-node-toleration` flags to land the agent on on the right node is OK
 
 ```shell
 aicr bundle \
   --recipe recipe.yaml \
-  --accelerated-node-selector nodeGroup=gpu-worker \
-  --accelerated-node-toleration dedicated=worker-workload:NoSchedule \
-  --accelerated-node-toleration dedicated=worker-workload:NoExecute \
-  --system-node-toleration dedicated=system-workload:NoSchedule \
-  --system-node-toleration dedicated=system-workload:NoExecute \
   --output bundle
 ```
 
-Sample output:
-```
-[cli] generating bundle: deployer=helm type=Helm per-component bundle recipe=recipe.yaml output=./bundle oci=false
-[cli] bundle generated: type=Helm per-component bundle files=42 size_bytes=666795 duration_sec=0.053811959 output_dir=./bundle
-
-Helm per-component bundle generated successfully!
-Output directory: ./bundle
-Files generated: 42
-
-To deploy:
-  1. cd ./bundle
-  2. chmod +x deploy.sh
-  3. ./deploy.sh
-```
-
 ## Install Bundle into the Cluster
 
 ```shell
-chmod +x deploy.sh
-./deploy.sh
+cd ./bundle && chmod +x deploy.sh && ./deploy.sh
 ```
 
 ## Validate Cluster 
 
 ```shell
 aicr validate \
+  --recipe recipe.yaml \
+  --output report.yaml \
   --phase readiness \
   --phase deployment \
-  --phase conformance \
-  --recipe recipe.yaml
-```
-
-Results (TODO: add full per-component health check and AI Conformance check)
-
-```
-recipeSource: recipe.yaml
-snapshotSource: agent:gpu-operator/aicr-validate
-summary:
-  passed: 4
-  failed: 0
-  skipped: 0
-  total: 4
-  status: pass
-  duration: 1.452461125s
-phases:
-  conformance:
-    status: skipped
-    reason: conformance phase not configured in recipe
-    duration: 9.709µs
-  deployment:
-    status: skipped
-    reason: deployment phase not configured in recipe
-    duration: 7.042µs
-  readiness:
-    status: pass
-    constraints:
-      - name: K8s.server.version
-        expected: '>= 1.34'
-        actual: v1.34.3-eks-ac2d5a0
-        status: passed
-      - name: OS.release.ID
-        expected: ubuntu
-        actual: ubuntu
-        status: passed
-      - name: OS.release.VERSION_ID
-        expected: "24.04"
-        actual: "24.04"
-        status: passed
-      - name: OS.sysctl./proc/sys/kernel/osrelease
-        expected: '>= 6.8'
-        actual: 6.14.0-1018-aws
-        status: passed
-    duration: 64µs
+  --phase conformance
 ```
 
 ## Run Inference Workload
 
 ### Create namespace and HuggingFace secret
 
+> Set HF_TOKEN env var first
+
 ```shell
 kubectl create ns dynamo-workload
 
-# Create HuggingFace token secret (set HF_TOKEN env var first)
 sed "s/<your-hf-token>/$HF_TOKEN/" \
   demos/workloads/inference/hf-token-secret.yaml | kubectl apply -f -
 ```
@@ -171,50 +68,22 @@ sed "s/<your-hf-token>/$HF_TOKEN/" \
 
 ```shell
 kubectl apply -f demos/workloads/inference/vllm-agg.yaml
-
-# Monitor deployment
-kubectl get dynamographdeployments -n dynamo-workload
-kubectl get pods -n dynamo-workload -w
 ```
 
-Wait until all pods are `Running` and ready:
-```
-NAME                                    READY   STATUS    RESTARTS   AGE
-vllm-agg-frontend-0                     1/1     Running   0          2m
-vllm-agg-vllmdecodeworker-0             1/1     Running   0          2m
-```
-
-### Architecture
+Monitor deployment, until all pods are `Running` and ready:
 
-```
-  ┌─────────┐   HTTP    ┌────────────────┐  NATS   ┌────────────────────┐
-  │  Client  │─────────▶│   Frontend     │────────▶│  VllmDecodeWorker  │
-  │ (OpenAI  │  :8000   │                │  :4222  │                    │
-  │  API)    │◀─────────│  vllm-runtime  │◀────────│  dynamo.vllm       │
-  └─────────┘           │  Qwen3-0.6B   │         │  Qwen3-0.6B       │
-                        │                │         │  1x H100 GPU       │
-                        │  CPU node      │         │  GPU node          │
-                        └────────────────┘         └────────────────────┘
-                         ip-100-64-83-166           ip-100-64-171-120
-                         svc: :8000                 svc: :9090
-
-  Services:
-    Frontend          1/1 Ready   componentType: frontend
-    VllmDecodeWorker  1/1 Ready   componentType: worker   gpu: 1
-
-  Flow:
-    1. Client sends OpenAI request (/v1/chat/completions) → Frontend :8000
-    2. Frontend dispatches inference work via NATS :4222
-    3. VllmDecodeWorker runs Qwen/Qwen3-0.6B on H100, returns result
-    4. Response streams back: Worker → NATS → Frontend → Client
+```shell
+kubectl get dynamographdeployments -n dynamo-workload
+kubectl get pods -n dynamo-workload -w
 ```
 
 ### Test the endpoint
 
 #### Option 1: Chat UI (browser)
 
+Launch the chat server (port-forward + local UI on port 9090)
+
 ```shell
-# Launch the chat server (port-forward + local UI on port 9090)
 ./demos/workloads/inference/chat-server.sh
 ```