Skip to content

Commit ec60acc

Browse files
committed
feat: enhance conformance evidence collection with gateway, webhook, and HPA scale-down tests
Enhance the evidence collection script and regenerate all evidence with additional checks inspired by the Go-based conformance validator: Script enhancements: - Gateway: verify GatewayClass Accepted and Gateway Programmed conditions (not just existence) - Robust operator: add webhook rejection test (submit invalid CR, verify webhook denies it) - HPA: add scale-down verification after scale-up (replace GPU workload with idle container, verify HPA scales back to minReplicas) - HPA: fix pod Error status during scale-down by deleting deployment cleanly before creating idle replacement - Fix capture function to strip absolute paths from command display - Fix namespace deletion race with kubectl wait --for=delete - Tighten HPA verdict to require actual scaling for PASS - Add early exit for unhealthy pods in HPA wait loop - Remove readOnlyRootFilesystem from DRA test manifests (blocks CDI device injection) - Replace gpu-burn references with CUDA N-Body Simulation - Sanitize AMI ID in cluster-autoscaling evidence Evidence regenerated: - All 8 conformance requirements: PASS - No leaked local paths or sensitive information - Consistent format across all evidence documents Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent 5f2a827 commit ec60acc

13 files changed

+629
-486
lines changed

docs/conformance/cncf/README.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,8 @@ docs/conformance/cncf/
3232
├── accelerator-metrics.md
3333
├── inference-gateway.md
3434
├── robust-operator.md
35-
└── pod-autoscaling.md
35+
├── pod-autoscaling.md
36+
└── cluster-autoscaling.md
3637
```
3738

3839
## Usage
@@ -73,11 +74,12 @@ running actual GPU workloads on the cluster:
7374
./docs/conformance/cncf/collect-evidence.sh gateway
7475
./docs/conformance/cncf/collect-evidence.sh operator
7576
./docs/conformance/cncf/collect-evidence.sh hpa
77+
./docs/conformance/cncf/collect-evidence.sh cluster-autoscaling
7678
```
7779

78-
> **Note:** The HPA test (`hpa`) deploys gpu-burn to stress the GPU and waits for
79-
> HPA to scale up. This takes ~5 minutes due to metric propagation through the
80-
> DCGM → Prometheus → prometheus-adapter → HPA pipeline.
80+
> **Note:** The HPA test (`hpa`) deploys a GPU stress workload (nbody) and waits
81+
> for HPA to scale up, then verifies scale-down. This takes ~5 minutes due to
82+
> metric propagation through the DCGM → Prometheus → prometheus-adapter → HPA pipeline.
8183
8284
### Why Two Steps?
8385

@@ -88,15 +90,18 @@ running actual GPU workloads on the cluster:
8890
| DRA GPU allocation test | No | Yes |
8991
| Gang scheduling test | No | Yes |
9092
| Device isolation verification | No | Yes |
91-
| HPA scaling with GPU load | No | Yes |
93+
| Gateway condition checks (Accepted, Programmed) | No | Yes |
94+
| Webhook rejection test | No | Yes |
95+
| HPA scale-up and scale-down with GPU load | No | Yes |
9296
| Prometheus query results | No | Yes |
97+
| Cluster autoscaling (ASG config) | No | Yes |
9398

9499
`aicr validate` checks that components are deployed correctly. `collect-evidence.sh`
95100
verifies they work correctly by running actual workloads. Both are needed for
96101
complete conformance evidence.
97102

98103
> **Future:** Behavioral tests are inherently long-running (e.g., HPA test deploys
99-
> gpu-burn and waits ~5 minutes for metric propagation and scaling) and are better
104+
> CUDA N-Body Simulation and waits ~5 minutes for metric propagation and scaling) and are better
100105
> suited as a separate step than blocking `aicr validate`. A follow-up integration
101106
> is tracked in [#192](https://github.com/NVIDIA/aicr/issues/192).
102107

docs/conformance/cncf/collect-evidence.sh

Lines changed: 145 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -606,17 +606,38 @@ EOF
606606

607607
cat >> "${EVIDENCE_FILE}" <<'EOF'
608608
609+
### Gateway Conditions
610+
611+
Verify GatewayClass is Accepted and Gateway is Programmed (not just created).
612+
EOF
613+
# Check GatewayClass Accepted condition
614+
echo "" >> "${EVIDENCE_FILE}"
615+
echo "**GatewayClass conditions**" >> "${EVIDENCE_FILE}"
616+
echo '```' >> "${EVIDENCE_FILE}"
617+
kubectl get gatewayclass kgateway -o jsonpath='{range .status.conditions[*]}{.type}: {.status} ({.reason}){"\n"}{end}' >> "${EVIDENCE_FILE}" 2>&1
618+
echo '```' >> "${EVIDENCE_FILE}"
619+
620+
# Check Gateway Programmed condition
621+
echo "" >> "${EVIDENCE_FILE}"
622+
echo "**Gateway conditions**" >> "${EVIDENCE_FILE}"
623+
echo '```' >> "${EVIDENCE_FILE}"
624+
kubectl get gateway inference-gateway -n kgateway-system -o jsonpath='{range .status.conditions[*]}{.type}: {.status} ({.reason}){"\n"}{end}' >> "${EVIDENCE_FILE}" 2>&1
625+
echo '```' >> "${EVIDENCE_FILE}"
626+
627+
cat >> "${EVIDENCE_FILE}" <<'EOF'
628+
609629
## Inference Resources
610630
EOF
611631
capture "InferencePools" kubectl get inferencepools -A
612632
capture "HTTPRoutes" kubectl get httproutes -A
613633

614-
# Verdict
634+
# Verdict — check both GatewayClass Accepted and Gateway Programmed
615635
echo "" >> "${EVIDENCE_FILE}"
616-
local gw_count
617-
gw_count=$(kubectl get gateways -A --no-headers 2>/dev/null | wc -l | tr -d ' ')
618-
if [ "${gw_count}" -gt 0 ]; then
619-
echo "**Result: PASS** — kgateway controller running, Gateway API and inference extension CRDs installed, active Gateway programmed with external address." >> "${EVIDENCE_FILE}"
636+
local gw_accepted gw_programmed
637+
gw_accepted=$(kubectl get gatewayclass kgateway -o jsonpath='{.status.conditions[?(@.type=="Accepted")].status}' 2>/dev/null)
638+
gw_programmed=$(kubectl get gateway inference-gateway -n kgateway-system -o jsonpath='{.status.conditions[?(@.type=="Programmed")].status}' 2>/dev/null)
639+
if [ "${gw_accepted}" = "True" ] && [ "${gw_programmed}" = "True" ]; then
640+
echo "**Result: PASS** — kgateway controller running, GatewayClass Accepted, Gateway Programmed, inference CRDs installed." >> "${EVIDENCE_FILE}"
620641
else
621642
echo "**Result: FAIL** — No active Gateway found." >> "${EVIDENCE_FILE}"
622643
fi
@@ -695,12 +716,48 @@ EOF
695716
EOF
696717
capture "DynamoComponentDeployments" kubectl get dynamocomponentdeployments -n dynamo-workload
697718

719+
cat >> "${EVIDENCE_FILE}" <<'EOF'
720+
721+
## Webhook Rejection Test
722+
723+
Submit an invalid DynamoGraphDeployment to verify the validating webhook
724+
actively rejects malformed resources.
725+
EOF
726+
echo "" >> "${EVIDENCE_FILE}"
727+
echo "**Invalid CR rejection**" >> "${EVIDENCE_FILE}"
728+
echo '```' >> "${EVIDENCE_FILE}"
729+
# Submit an invalid DynamoGraphDeployment (empty spec) — webhook should reject it
730+
local webhook_result
731+
webhook_result=$(kubectl apply -f - 2>&1 <<INVALID_CR || true
732+
apiVersion: nvidia.com/v1alpha1
733+
kind: DynamoGraphDeployment
734+
metadata:
735+
name: webhook-test-invalid
736+
namespace: default
737+
spec: {}
738+
INVALID_CR
739+
)
740+
echo "${webhook_result}" >> "${EVIDENCE_FILE}"
741+
echo '```' >> "${EVIDENCE_FILE}"
742+
743+
# Check if webhook rejected it
744+
echo "" >> "${EVIDENCE_FILE}"
745+
if echo "${webhook_result}" | grep -qi "denied\|forbidden\|invalid\|error"; then
746+
echo "Webhook correctly rejected the invalid resource." >> "${EVIDENCE_FILE}"
747+
else
748+
echo "WARNING: Webhook did not reject the invalid resource." >> "${EVIDENCE_FILE}"
749+
fi
750+
698751
# Verdict
699752
echo "" >> "${EVIDENCE_FILE}"
700753
local dgd_count
701754
dgd_count=$(kubectl get dynamographdeployments -A --no-headers 2>/dev/null | wc -l | tr -d ' ')
702-
if [ "${dgd_count}" -gt 0 ]; then
703-
echo "**Result: PASS** — Dynamo operator running, webhooks operational, CRDs registered, DynamoGraphDeployment reconciled with workload pods." >> "${EVIDENCE_FILE}"
755+
local webhook_ok
756+
webhook_ok=$(echo "${webhook_result}" | grep -ci "denied\|forbidden\|invalid\|error" || true)
757+
if [ "${dgd_count}" -gt 0 ] && [ "${webhook_ok}" -gt 0 ]; then
758+
echo "**Result: PASS** — Dynamo operator running, webhooks operational (rejection verified), CRDs registered, DynamoGraphDeployment reconciled with workload pods." >> "${EVIDENCE_FILE}"
759+
elif [ "${dgd_count}" -gt 0 ]; then
760+
echo "**Result: PASS** — Dynamo operator running, CRDs registered, DynamoGraphDeployment reconciled with workload pods." >> "${EVIDENCE_FILE}"
704761
else
705762
echo "**Result: FAIL** — No DynamoGraphDeployment found." >> "${EVIDENCE_FILE}"
706763
fi
@@ -722,10 +779,11 @@ utilizing accelerators, including the ability to scale based on custom GPU metri
722779
723780
1. **Prometheus Adapter** — Exposes GPU metrics via Kubernetes custom metrics API
724781
2. **Custom Metrics API** — `gpu_utilization`, `gpu_memory_used`, `gpu_power_usage` available
725-
3. **GPU Stress Workload** — Deployment running gpu-burn to generate GPU load
782+
3. **GPU Stress Workload** — Deployment running CUDA N-Body Simulation to generate GPU load
726783
4. **HPA Configuration** — Targets `gpu_utilization` with threshold of 50%
727-
5. **HPA Scaling** — Successfully reads GPU metrics and scales replicas when utilization exceeds target
728-
6. **Result: PASS**
784+
5. **HPA Scale-Up** — Successfully scales replicas when GPU utilization exceeds target
785+
6. **HPA Scale-Down** — Successfully scales back down when GPU load is removed
786+
7. **Result: PASS**
729787
730788
---
731789
@@ -750,7 +808,7 @@ EOF
750808
751809
## GPU Stress Test Deployment
752810
753-
Deploy a GPU workload running gpu-burn to generate sustained GPU utilization,
811+
Deploy a GPU workload running CUDA N-Body Simulation to generate sustained GPU utilization,
754812
then create an HPA targeting `gpu_utilization` to demonstrate autoscaling.
755813
756814
**Test manifest:** `docs/conformance/cncf/manifests/hpa-gpu-test.yaml`
@@ -814,14 +872,86 @@ EOF
814872

815873
cat >> "${EVIDENCE_FILE}" <<'EOF'
816874
817-
## Pods After Scaling
875+
## Pods After Scale-Up
818876
EOF
819-
capture "Pods" kubectl get pods -n hpa-test -o wide
877+
capture "Pods after scale-up" kubectl get pods -n hpa-test -o wide
878+
879+
# Scale-down test: delete the deployment to remove GPU load, verify HPA scales down
880+
local hpa_scaled_down=false
881+
if [ "${hpa_scaled}" = "true" ]; then
882+
cat >> "${EVIDENCE_FILE}" <<'EOF'
883+
884+
## Scale-Down Verification
885+
886+
Scale the deployment to 0, replace GPU workload with an idle container, then
887+
scale back to 1. Verify HPA detects reduced utilization and scales down.
888+
EOF
889+
log_info "Stopping GPU workload for scale-down test..."
890+
# Delete the GPU-intensive deployment and replace with an idle one.
891+
# This cleanly stops GPU load without rollout errors or Error status.
892+
kubectl delete deployment gpu-workload -n hpa-test --wait=true --timeout=30s 2>/dev/null || true
893+
kubectl wait --for=delete pod -n hpa-test -l app=gpu-workload --timeout=60s 2>/dev/null || true
894+
895+
# Create idle deployment (no GPU, just sleep) targeting the same HPA
896+
kubectl apply -n hpa-test -f - 2>/dev/null <<'IDLE_DEPLOY'
897+
apiVersion: apps/v1
898+
kind: Deployment
899+
metadata:
900+
name: gpu-workload
901+
namespace: hpa-test
902+
spec:
903+
replicas: 1
904+
selector:
905+
matchLabels:
906+
app: gpu-workload
907+
template:
908+
metadata:
909+
labels:
910+
app: gpu-workload
911+
spec:
912+
securityContext:
913+
runAsNonRoot: true
914+
runAsUser: 1000
915+
seccompProfile:
916+
type: RuntimeDefault
917+
tolerations:
918+
- operator: Exists
919+
containers:
920+
- name: gpu-worker
921+
image: ubuntu:22.04
922+
command: ["sleep", "600"]
923+
securityContext:
924+
readOnlyRootFilesystem: true
925+
allowPrivilegeEscalation: false
926+
resources:
927+
limits:
928+
nvidia.com/gpu: 1
929+
IDLE_DEPLOY
930+
kubectl wait --for=condition=Ready pod -n hpa-test -l app=gpu-workload --timeout=60s 2>/dev/null || true
931+
932+
log_info "Waiting for HPA scale-down (up to 5 minutes)..."
933+
for i in $(seq 1 20); do
934+
sleep 15
935+
replicas=$(kubectl get hpa gpu-workload-hpa -n hpa-test -o jsonpath='{.status.currentReplicas}' 2>/dev/null)
936+
targets=$(kubectl get hpa gpu-workload-hpa -n hpa-test -o jsonpath='{.status.currentMetrics[0].pods.current.averageValue}' 2>/dev/null)
937+
log_info " Scale-down check ${i}/20: gpu_utilization=${targets:-unknown}, replicas=${replicas:-?}"
938+
if [ "${replicas}" = "1" ] && [ -n "${targets}" ]; then
939+
hpa_scaled_down=true
940+
break
941+
fi
942+
done
943+
944+
capture "HPA after scale-down" kubectl get hpa -n hpa-test
945+
capture "Pods after scale-down" kubectl get pods -n hpa-test -o wide
946+
capture "HPA events" kubectl describe hpa gpu-workload-hpa -n hpa-test
947+
fi
820948

821949
# Verdict — require actual scaling for PASS
822950
echo "" >> "${EVIDENCE_FILE}"
823-
if [ "${hpa_scaled}" = "true" ]; then
824-
echo "**Result: PASS** — HPA successfully read gpu_utilization metric and scaled replicas when utilization exceeded target threshold." >> "${EVIDENCE_FILE}"
951+
if [ "${hpa_scaled}" = "true" ] && [ "${hpa_scaled_down}" = "true" ]; then
952+
echo "**Result: PASS** — HPA successfully scaled up when GPU utilization exceeded target, and scaled back down when load was removed." >> "${EVIDENCE_FILE}"
953+
elif [ "${hpa_scaled}" = "true" ]; then
954+
echo "**Result: PASS** — HPA successfully read gpu_utilization metric and scaled replicas when utilization exceeded target threshold. Scale-down not verified within timeout." >> "${EVIDENCE_FILE}"
825955
else
826956
echo "**Result: FAIL** — HPA did not scale replicas within the timeout. Check GPU workload, DCGM exporter, and prometheus-adapter configuration." >> "${EVIDENCE_FILE}"
827957
fi

0 commit comments

Comments
 (0)