Skip to content

Commit dfe36cc

Browse files
committed
feat: enhance conformance evidence collection with gateway, webhook, and HPA scale-down tests
Enhance the evidence collection script and regenerate all evidence with additional checks inspired by the Go-based conformance validator: Script enhancements: - Gateway: verify GatewayClass Accepted and Gateway Programmed conditions (not just existence) - Robust operator: add webhook rejection test (submit invalid CR, verify webhook denies it) - HPA: add scale-down verification after scale-up (replace GPU workload with idle container, verify HPA scales back to minReplicas) - HPA: fix pod Error status during scale-down by deleting deployment cleanly before creating idle replacement - Fix capture function to strip absolute paths from command display - Fix namespace deletion race with kubectl wait --for=delete - Tighten HPA verdict to require actual scaling for PASS - Add early exit for unhealthy pods in HPA wait loop - Remove readOnlyRootFilesystem from DRA test manifests (blocks CDI device injection) - Replace gpu-burn references with CUDA N-Body Simulation - Sanitize AMI ID in cluster-autoscaling evidence Evidence regenerated: - All 8 conformance requirements: PASS - No leaked local paths or sensitive information - Consistent format across all evidence documents Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent 7f4da79 commit dfe36cc

File tree

14 files changed

+843
-486
lines changed

14 files changed

+843
-486
lines changed

docs/conformance/cncf/README.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,8 @@ docs/conformance/cncf/
3232
├── accelerator-metrics.md
3333
├── inference-gateway.md
3434
├── robust-operator.md
35-
└── pod-autoscaling.md
35+
├── pod-autoscaling.md
36+
└── cluster-autoscaling.md
3637
```
3738

3839
## Usage
@@ -73,11 +74,12 @@ running actual GPU workloads on the cluster:
7374
./docs/conformance/cncf/collect-evidence.sh gateway
7475
./docs/conformance/cncf/collect-evidence.sh operator
7576
./docs/conformance/cncf/collect-evidence.sh hpa
77+
./docs/conformance/cncf/collect-evidence.sh cluster-autoscaling
7678
```
7779

78-
> **Note:** The HPA test (`hpa`) deploys gpu-burn to stress the GPU and waits for
79-
> HPA to scale up. This takes ~5 minutes due to metric propagation through the
80-
> DCGM → Prometheus → prometheus-adapter → HPA pipeline.
80+
> **Note:** The HPA test (`hpa`) deploys a GPU stress workload (nbody) and waits
81+
> for HPA to scale up, then verifies scale-down. This takes ~5 minutes due to
82+
> metric propagation through the DCGM → Prometheus → prometheus-adapter → HPA pipeline.
8183
8284
### Why Two Steps?
8385

@@ -88,15 +90,18 @@ running actual GPU workloads on the cluster:
8890
| DRA GPU allocation test | No | Yes |
8991
| Gang scheduling test | No | Yes |
9092
| Device isolation verification | No | Yes |
91-
| HPA scaling with GPU load | No | Yes |
93+
| Gateway condition checks (Accepted, Programmed) | No | Yes |
94+
| Webhook rejection test | No | Yes |
95+
| HPA scale-up and scale-down with GPU load | No | Yes |
9296
| Prometheus query results | No | Yes |
97+
| Cluster autoscaling (ASG config) | No | Yes |
9398

9499
`aicr validate` checks that components are deployed correctly. `collect-evidence.sh`
95100
verifies they work correctly by running actual workloads. Both are needed for
96101
complete conformance evidence.
97102

98103
> **Future:** Behavioral tests are inherently long-running (e.g., HPA test deploys
99-
> gpu-burn and waits ~5 minutes for metric propagation and scaling) and are better
104+
> CUDA N-Body Simulation and waits ~5 minutes for metric propagation and scaling) and are better
100105
> suited as a separate step than blocking `aicr validate`. A follow-up integration
101106
> is tracked in [#192](https://github.com/NVIDIA/aicr/issues/192).
102107

docs/conformance/cncf/collect-evidence.sh

Lines changed: 154 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,9 @@ Deploy a test pod that requests 1 GPU via ResourceClaim and verifies device acce
137137
138138
**Test manifest:** `docs/conformance/cncf/manifests/dra-gpu-test.yaml`
139139
EOF
140+
echo '```yaml' >> "${EVIDENCE_FILE}"
141+
cat "${SCRIPT_DIR}/manifests/dra-gpu-test.yaml" >> "${EVIDENCE_FILE}"
142+
echo '```' >> "${EVIDENCE_FILE}"
140143

141144
# Clean up any previous run
142145
kubectl delete namespace dra-test --ignore-not-found --wait=false 2>/dev/null || true
@@ -202,6 +205,9 @@ pods are scheduled atomically.
202205
203206
**Test manifest:** `docs/conformance/cncf/manifests/gang-scheduling-test.yaml`
204207
EOF
208+
echo '```yaml' >> "${EVIDENCE_FILE}"
209+
cat "${SCRIPT_DIR}/manifests/gang-scheduling-test.yaml" >> "${EVIDENCE_FILE}"
210+
echo '```' >> "${EVIDENCE_FILE}"
205211

206212
# Clean up any previous run
207213
kubectl delete namespace gang-scheduling-test --ignore-not-found --wait=false 2>/dev/null || true
@@ -606,17 +612,38 @@ EOF
606612

607613
cat >> "${EVIDENCE_FILE}" <<'EOF'
608614
615+
### Gateway Conditions
616+
617+
Verify GatewayClass is Accepted and Gateway is Programmed (not just created).
618+
EOF
619+
# Check GatewayClass Accepted condition
620+
echo "" >> "${EVIDENCE_FILE}"
621+
echo "**GatewayClass conditions**" >> "${EVIDENCE_FILE}"
622+
echo '```' >> "${EVIDENCE_FILE}"
623+
kubectl get gatewayclass kgateway -o jsonpath='{range .status.conditions[*]}{.type}: {.status} ({.reason}){"\n"}{end}' >> "${EVIDENCE_FILE}" 2>&1
624+
echo '```' >> "${EVIDENCE_FILE}"
625+
626+
# Check Gateway Programmed condition
627+
echo "" >> "${EVIDENCE_FILE}"
628+
echo "**Gateway conditions**" >> "${EVIDENCE_FILE}"
629+
echo '```' >> "${EVIDENCE_FILE}"
630+
kubectl get gateway inference-gateway -n kgateway-system -o jsonpath='{range .status.conditions[*]}{.type}: {.status} ({.reason}){"\n"}{end}' >> "${EVIDENCE_FILE}" 2>&1
631+
echo '```' >> "${EVIDENCE_FILE}"
632+
633+
cat >> "${EVIDENCE_FILE}" <<'EOF'
634+
609635
## Inference Resources
610636
EOF
611637
capture "InferencePools" kubectl get inferencepools -A
612638
capture "HTTPRoutes" kubectl get httproutes -A
613639

614-
# Verdict
640+
# Verdict — check both GatewayClass Accepted and Gateway Programmed
615641
echo "" >> "${EVIDENCE_FILE}"
616-
local gw_count
617-
gw_count=$(kubectl get gateways -A --no-headers 2>/dev/null | wc -l | tr -d ' ')
618-
if [ "${gw_count}" -gt 0 ]; then
619-
echo "**Result: PASS** — kgateway controller running, Gateway API and inference extension CRDs installed, active Gateway programmed with external address." >> "${EVIDENCE_FILE}"
642+
local gw_accepted gw_programmed
643+
gw_accepted=$(kubectl get gatewayclass kgateway -o jsonpath='{.status.conditions[?(@.type=="Accepted")].status}' 2>/dev/null)
644+
gw_programmed=$(kubectl get gateway inference-gateway -n kgateway-system -o jsonpath='{.status.conditions[?(@.type=="Programmed")].status}' 2>/dev/null)
645+
if [ "${gw_accepted}" = "True" ] && [ "${gw_programmed}" = "True" ]; then
646+
echo "**Result: PASS** — kgateway controller running, GatewayClass Accepted, Gateway Programmed, inference CRDs installed." >> "${EVIDENCE_FILE}"
620647
else
621648
echo "**Result: FAIL** — No active Gateway found." >> "${EVIDENCE_FILE}"
622649
fi
@@ -695,12 +722,48 @@ EOF
695722
EOF
696723
capture "DynamoComponentDeployments" kubectl get dynamocomponentdeployments -n dynamo-workload
697724

725+
cat >> "${EVIDENCE_FILE}" <<'EOF'
726+
727+
## Webhook Rejection Test
728+
729+
Submit an invalid DynamoGraphDeployment to verify the validating webhook
730+
actively rejects malformed resources.
731+
EOF
732+
echo "" >> "${EVIDENCE_FILE}"
733+
echo "**Invalid CR rejection**" >> "${EVIDENCE_FILE}"
734+
echo '```' >> "${EVIDENCE_FILE}"
735+
# Submit an invalid DynamoGraphDeployment (empty spec) — webhook should reject it
736+
local webhook_result
737+
webhook_result=$(kubectl apply -f - 2>&1 <<INVALID_CR || true
738+
apiVersion: nvidia.com/v1alpha1
739+
kind: DynamoGraphDeployment
740+
metadata:
741+
name: webhook-test-invalid
742+
namespace: default
743+
spec: {}
744+
INVALID_CR
745+
)
746+
echo "${webhook_result}" >> "${EVIDENCE_FILE}"
747+
echo '```' >> "${EVIDENCE_FILE}"
748+
749+
# Check if webhook rejected it
750+
echo "" >> "${EVIDENCE_FILE}"
751+
if echo "${webhook_result}" | grep -qi "denied\|forbidden\|invalid\|error"; then
752+
echo "Webhook correctly rejected the invalid resource." >> "${EVIDENCE_FILE}"
753+
else
754+
echo "WARNING: Webhook did not reject the invalid resource." >> "${EVIDENCE_FILE}"
755+
fi
756+
698757
# Verdict
699758
echo "" >> "${EVIDENCE_FILE}"
700759
local dgd_count
701760
dgd_count=$(kubectl get dynamographdeployments -A --no-headers 2>/dev/null | wc -l | tr -d ' ')
702-
if [ "${dgd_count}" -gt 0 ]; then
703-
echo "**Result: PASS** — Dynamo operator running, webhooks operational, CRDs registered, DynamoGraphDeployment reconciled with workload pods." >> "${EVIDENCE_FILE}"
761+
local webhook_ok
762+
webhook_ok=$(echo "${webhook_result}" | grep -ci "denied\|forbidden\|invalid\|error" || true)
763+
if [ "${dgd_count}" -gt 0 ] && [ "${webhook_ok}" -gt 0 ]; then
764+
echo "**Result: PASS** — Dynamo operator running, webhooks operational (rejection verified), CRDs registered, DynamoGraphDeployment reconciled with workload pods." >> "${EVIDENCE_FILE}"
765+
elif [ "${dgd_count}" -gt 0 ]; then
766+
echo "**Result: PASS** — Dynamo operator running, CRDs registered, DynamoGraphDeployment reconciled with workload pods." >> "${EVIDENCE_FILE}"
704767
else
705768
echo "**Result: FAIL** — No DynamoGraphDeployment found." >> "${EVIDENCE_FILE}"
706769
fi
@@ -722,10 +785,11 @@ utilizing accelerators, including the ability to scale based on custom GPU metri
722785
723786
1. **Prometheus Adapter** — Exposes GPU metrics via Kubernetes custom metrics API
724787
2. **Custom Metrics API** — `gpu_utilization`, `gpu_memory_used`, `gpu_power_usage` available
725-
3. **GPU Stress Workload** — Deployment running gpu-burn to generate GPU load
788+
3. **GPU Stress Workload** — Deployment running CUDA N-Body Simulation to generate GPU load
726789
4. **HPA Configuration** — Targets `gpu_utilization` with threshold of 50%
727-
5. **HPA Scaling** — Successfully reads GPU metrics and scales replicas when utilization exceeds target
728-
6. **Result: PASS**
790+
5. **HPA Scale-Up** — Successfully scales replicas when GPU utilization exceeds target
791+
6. **HPA Scale-Down** — Successfully scales back down when GPU load is removed
792+
7. **Result: PASS**
729793
730794
---
731795
@@ -750,11 +814,14 @@ EOF
750814
751815
## GPU Stress Test Deployment
752816
753-
Deploy a GPU workload running gpu-burn to generate sustained GPU utilization,
817+
Deploy a GPU workload running CUDA N-Body Simulation to generate sustained GPU utilization,
754818
then create an HPA targeting `gpu_utilization` to demonstrate autoscaling.
755819
756820
**Test manifest:** `docs/conformance/cncf/manifests/hpa-gpu-test.yaml`
757821
EOF
822+
echo '```yaml' >> "${EVIDENCE_FILE}"
823+
cat "${SCRIPT_DIR}/manifests/hpa-gpu-test.yaml" >> "${EVIDENCE_FILE}"
824+
echo '```' >> "${EVIDENCE_FILE}"
758825

759826
# Clean up any previous run
760827
kubectl delete namespace hpa-test --ignore-not-found 2>/dev/null || true
@@ -814,14 +881,86 @@ EOF
814881

815882
cat >> "${EVIDENCE_FILE}" <<'EOF'
816883
817-
## Pods After Scaling
884+
## Pods After Scale-Up
818885
EOF
819-
capture "Pods" kubectl get pods -n hpa-test -o wide
886+
capture "Pods after scale-up" kubectl get pods -n hpa-test -o wide
887+
888+
# Scale-down test: delete the deployment to remove GPU load, verify HPA scales down
889+
local hpa_scaled_down=false
890+
if [ "${hpa_scaled}" = "true" ]; then
891+
cat >> "${EVIDENCE_FILE}" <<'EOF'
892+
893+
## Scale-Down Verification
894+
895+
Scale the deployment to 0, replace GPU workload with an idle container, then
896+
scale back to 1. Verify HPA detects reduced utilization and scales down.
897+
EOF
898+
log_info "Stopping GPU workload for scale-down test..."
899+
# Delete the GPU-intensive deployment and replace with an idle one.
900+
# This cleanly stops GPU load without rollout errors or Error status.
901+
kubectl delete deployment gpu-workload -n hpa-test --wait=true --timeout=30s 2>/dev/null || true
902+
kubectl wait --for=delete pod -n hpa-test -l app=gpu-workload --timeout=60s 2>/dev/null || true
903+
904+
# Create idle deployment (no GPU, just sleep) targeting the same HPA
905+
kubectl apply -n hpa-test -f - 2>/dev/null <<'IDLE_DEPLOY'
906+
apiVersion: apps/v1
907+
kind: Deployment
908+
metadata:
909+
name: gpu-workload
910+
namespace: hpa-test
911+
spec:
912+
replicas: 1
913+
selector:
914+
matchLabels:
915+
app: gpu-workload
916+
template:
917+
metadata:
918+
labels:
919+
app: gpu-workload
920+
spec:
921+
securityContext:
922+
runAsNonRoot: true
923+
runAsUser: 1000
924+
seccompProfile:
925+
type: RuntimeDefault
926+
tolerations:
927+
- operator: Exists
928+
containers:
929+
- name: gpu-worker
930+
image: ubuntu:22.04
931+
command: ["sleep", "600"]
932+
securityContext:
933+
readOnlyRootFilesystem: true
934+
allowPrivilegeEscalation: false
935+
resources:
936+
limits:
937+
nvidia.com/gpu: 1
938+
IDLE_DEPLOY
939+
kubectl wait --for=condition=Ready pod -n hpa-test -l app=gpu-workload --timeout=60s 2>/dev/null || true
940+
941+
log_info "Waiting for HPA scale-down (up to 5 minutes)..."
942+
for i in $(seq 1 20); do
943+
sleep 15
944+
replicas=$(kubectl get hpa gpu-workload-hpa -n hpa-test -o jsonpath='{.status.currentReplicas}' 2>/dev/null)
945+
targets=$(kubectl get hpa gpu-workload-hpa -n hpa-test -o jsonpath='{.status.currentMetrics[0].pods.current.averageValue}' 2>/dev/null)
946+
log_info " Scale-down check ${i}/20: gpu_utilization=${targets:-unknown}, replicas=${replicas:-?}"
947+
if [ "${replicas}" = "1" ] && [ -n "${targets}" ]; then
948+
hpa_scaled_down=true
949+
break
950+
fi
951+
done
952+
953+
capture "HPA after scale-down" kubectl get hpa -n hpa-test
954+
capture "Pods after scale-down" kubectl get pods -n hpa-test -o wide
955+
capture "HPA events" kubectl describe hpa gpu-workload-hpa -n hpa-test
956+
fi
820957

821958
# Verdict — require actual scaling for PASS
822959
echo "" >> "${EVIDENCE_FILE}"
823-
if [ "${hpa_scaled}" = "true" ]; then
824-
echo "**Result: PASS** — HPA successfully read gpu_utilization metric and scaled replicas when utilization exceeded target threshold." >> "${EVIDENCE_FILE}"
960+
if [ "${hpa_scaled}" = "true" ] && [ "${hpa_scaled_down}" = "true" ]; then
961+
echo "**Result: PASS** — HPA successfully scaled up when GPU utilization exceeded target, and scaled back down when load was removed." >> "${EVIDENCE_FILE}"
962+
elif [ "${hpa_scaled}" = "true" ]; then
963+
echo "**Result: PASS** — HPA successfully read gpu_utilization metric and scaled replicas when utilization exceeded target threshold. Scale-down not verified within timeout." >> "${EVIDENCE_FILE}"
825964
else
826965
echo "**Result: FAIL** — HPA did not scale replicas within the timeout. Check GPU workload, DCGM exporter, and prometheus-adapter configuration." >> "${EVIDENCE_FILE}"
827966
fi

0 commit comments

Comments
 (0)