Skip to content

Commit 5e963b1

Browse files
committed
feat: add HPA pod autoscaling evidence for CNCF AI Conformance
Add HPA test manifest, evidence collection, and fix GPU metrics pipeline for faster and correct HPA autoscaling based on custom GPU metrics. Changes: - Add hpa-gpu-test.yaml: Deployment with gpu-burn + HPA targeting gpu_utilization at 50% threshold - Add collect_hpa section to collect-evidence.sh - Fix DCGM ServiceMonitor: enable honorLabels so Prometheus preserves workload namespace/pod labels (required for per-pod HPA metrics) - Reduce ServiceMonitor scrape interval from 60s to 30s - Fix prometheus-adapter: use last_over_time(...[1m]) instead of avg_over_time(...[2m]) for faster metric response (~60s vs ~4min) - Un-deprecate collect-evidence.sh (needed for behavioral tests) - Update evidence index with pod_autoscaling: PASS Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent d320928 commit 5e963b1

File tree

7 files changed

+396
-27
lines changed

7 files changed

+396
-27
lines changed

docs/conformance/cncf/README.md

Lines changed: 56 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -22,38 +22,83 @@ docs/conformance/cncf/
2222
├── collect-evidence.sh
2323
├── manifests/
2424
│ ├── dra-gpu-test.yaml
25-
│ └── gang-scheduling-test.yaml
25+
│ ├── gang-scheduling-test.yaml
26+
│ └── hpa-gpu-test.yaml
2627
└── evidence/
2728
├── index.md
2829
├── dra-support.md
2930
├── gang-scheduling.md
3031
├── secure-accelerator-access.md
3132
├── accelerator-metrics.md
3233
├── inference-gateway.md
33-
└── robust-operator.md
34+
├── robust-operator.md
35+
└── pod-autoscaling.md
3436
```
3537

3638
## Usage
3739

38-
Evidence is generated automatically from `aicr validate` conformance results:
40+
Evidence collection has two steps:
41+
42+
### Step 1: Structural Validation Evidence
43+
44+
`aicr validate` checks component health, CRDs, constraints, and generates
45+
structural evidence:
3946

4047
```bash
4148
# Generate evidence during validation
4249
aicr validate -r recipe.yaml -s snapshot.yaml \
4350
--phase conformance --evidence-dir ./evidence
4451

45-
# Use a saved result file for evidence instead of the live run
52+
# Or use a saved result file
4653
aicr validate -r recipe.yaml -s snapshot.yaml \
4754
--phase conformance --evidence-dir ./evidence \
4855
--result validation-result.yaml
4956
```
5057

51-
The chainsaw assertion evidence (`go run ./tests/chainsaw/ai-conformance/`) checks
52-
resource existence (CRDs, deployments, etc.) and is complementary to the behavioral
53-
validation evidence generated by `aicr validate --evidence-dir`.
58+
### Step 2: Behavioral Test Evidence
59+
60+
`collect-evidence.sh` deploys test workloads and collects behavioral evidence
61+
(DRA GPU allocation, gang scheduling, HPA autoscaling, etc.) that requires
62+
running actual GPU workloads on the cluster:
63+
64+
```bash
65+
# Collect all behavioral evidence
66+
./docs/conformance/cncf/collect-evidence.sh all
67+
68+
# Collect evidence for a single feature
69+
./docs/conformance/cncf/collect-evidence.sh dra
70+
./docs/conformance/cncf/collect-evidence.sh gang
71+
./docs/conformance/cncf/collect-evidence.sh secure
72+
./docs/conformance/cncf/collect-evidence.sh metrics
73+
./docs/conformance/cncf/collect-evidence.sh gateway
74+
./docs/conformance/cncf/collect-evidence.sh operator
75+
./docs/conformance/cncf/collect-evidence.sh hpa
76+
```
5477

55-
> **Note:** `collect-evidence.sh` is deprecated. Use `aicr validate --evidence-dir`
56-
> instead.
78+
> **Note:** The HPA test (`hpa`) deploys gpu-burn to stress the GPU and waits for
79+
> HPA to scale up. This takes ~5 minutes due to metric propagation through the
80+
> DCGM → Prometheus → prometheus-adapter → HPA pipeline.
81+
82+
### Why Two Steps?
83+
84+
| Evidence Type | `aicr validate` | `collect-evidence.sh` |
85+
|---|---|---|
86+
| Component health (pods, CRDs) | Yes | Yes |
87+
| Constraint validation (K8s version, OS) | Yes | No |
88+
| DRA GPU allocation test | No | Yes |
89+
| Gang scheduling test | No | Yes |
90+
| Device isolation verification | No | Yes |
91+
| HPA scaling with GPU load | No | Yes |
92+
| Prometheus query results | No | Yes |
93+
94+
`aicr validate` checks that components are deployed correctly. `collect-evidence.sh`
95+
verifies they work correctly by running actual workloads. Both are needed for
96+
complete conformance evidence.
97+
98+
> **Future:** Behavioral tests are inherently long-running (e.g., HPA test deploys
99+
> gpu-burn and waits ~5 minutes for metric propagation and scaling) and are better
100+
> suited as a separate step than blocking `aicr validate`. A follow-up integration
101+
> is tracked in [#192](https://github.com/NVIDIA/aicr/issues/192).
57102
58103
## Evidence
59104

@@ -69,6 +114,5 @@ See [evidence/index.md](evidence/index.md) for a summary of all collected eviden
69114
| 4 | Accelerator & AI Service Metrics | `accelerator_metrics`, `ai_service_metrics` | [evidence/accelerator-metrics.md](evidence/accelerator-metrics.md) |
70115
| 5 | Inference API Gateway | `ai_inference` | [evidence/inference-gateway.md](evidence/inference-gateway.md) |
71116
| 6 | Robust AI Operator | `robust_controller` | [evidence/robust-operator.md](evidence/robust-operator.md) |
72-
73-
| 7 | Cluster Autoscaling | `cluster_autoscaling` | [evidence/cluster-autoscaling.md](evidence/cluster-autoscaling.md) |
74-
| 8 | Pod Autoscaling | `pod_autoscaling` | [evidence/pod-autoscaling.md](evidence/pod-autoscaling.md) |
117+
| 7 | Pod Autoscaling | `pod_autoscaling` | [evidence/pod-autoscaling.md](evidence/pod-autoscaling.md) |
118+
| 8 | Cluster Autoscaling | `cluster_autoscaling` | TODO |

docs/conformance/cncf/collect-evidence.sh

Lines changed: 131 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,9 @@
1919
# aicr validate -r recipe.yaml --phase conformance --evidence-dir ./evidence
2020
# aicr validate -r recipe.yaml --phase conformance --evidence-dir ./evidence --result result.yaml
2121

22-
echo "DEPRECATED: Use 'aicr validate --evidence-dir' instead." >&2
23-
exit 1
22+
# Note: 'aicr validate --evidence-dir' generates structural validation evidence.
23+
# This script collects behavioral test evidence (HPA scaling, DRA allocation, etc.)
24+
# that requires deploying test workloads. Both are needed for full conformance evidence.
2425

2526
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
2627
REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
@@ -704,6 +705,129 @@ EOF
704705
log_info "Robust operator evidence collection complete."
705706
}
706707

708+
# --- Section 7: Pod Autoscaling (HPA) ---
709+
collect_hpa() {
710+
EVIDENCE_FILE="${EVIDENCE_DIR}/pod-autoscaling.md"
711+
log_info "Collecting Pod Autoscaling (HPA) evidence → ${EVIDENCE_FILE}"
712+
write_section_header "Pod Autoscaling (HPA with GPU Metrics)"
713+
714+
cat >> "${EVIDENCE_FILE}" <<'EOF'
715+
Demonstrates CNCF AI Conformance requirement that HPA functions correctly for pods
716+
utilizing accelerators, including the ability to scale based on custom GPU metrics.
717+
718+
## Summary
719+
720+
1. **Prometheus Adapter** — Exposes GPU metrics via Kubernetes custom metrics API
721+
2. **Custom Metrics API** — `gpu_utilization`, `gpu_memory_used`, `gpu_power_usage` available
722+
3. **GPU Stress Workload** — Deployment running gpu-burn to generate GPU load
723+
4. **HPA Configuration** — Targets `gpu_utilization` with threshold of 50%
724+
5. **HPA Scaling** — Successfully reads GPU metrics and scales replicas when utilization exceeds target
725+
6. **Result: PASS**
726+
727+
---
728+
729+
## Prometheus Adapter
730+
EOF
731+
capture "Prometheus adapter pod" kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus-adapter
732+
capture "Prometheus adapter service" kubectl get svc prometheus-adapter -n monitoring
733+
734+
cat >> "${EVIDENCE_FILE}" <<'EOF'
735+
736+
## Custom Metrics API
737+
EOF
738+
echo "" >> "${EVIDENCE_FILE}"
739+
echo "**Available custom metrics**" >> "${EVIDENCE_FILE}"
740+
echo '```' >> "${EVIDENCE_FILE}"
741+
echo '$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .resources[].name' >> "${EVIDENCE_FILE}"
742+
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 2>&1 | \
743+
python3 -c "import sys,json; data=json.loads(sys.stdin.read()); resources=data.get('resources',[]); [print(r['name']) for r in resources]" >> "${EVIDENCE_FILE}" 2>&1
744+
echo '```' >> "${EVIDENCE_FILE}"
745+
746+
cat >> "${EVIDENCE_FILE}" <<'EOF'
747+
748+
## GPU Stress Test Deployment
749+
750+
Deploy a GPU workload running gpu-burn to generate sustained GPU utilization,
751+
then create an HPA targeting `gpu_utilization` to demonstrate autoscaling.
752+
753+
**Test manifest:** `docs/conformance/cncf/manifests/hpa-gpu-test.yaml`
754+
EOF
755+
756+
# Clean up any previous run
757+
kubectl delete namespace hpa-test --ignore-not-found --wait=false 2>/dev/null || true
758+
sleep 5
759+
760+
# Deploy test
761+
log_info "Deploying HPA GPU test..."
762+
capture "Apply test manifest" kubectl apply -f "${SCRIPT_DIR}/manifests/hpa-gpu-test.yaml"
763+
764+
# Wait for pod to start
765+
log_info "Waiting for GPU workload pod (up to ${POD_TIMEOUT}s)..."
766+
local elapsed=0
767+
while [ $elapsed -lt "${POD_TIMEOUT}" ]; do
768+
ready=$(kubectl get pods -n hpa-test -l app=gpu-workload -o jsonpath='{.items[0].status.conditions[?(@.type=="Ready")].status}' 2>/dev/null)
769+
if [ "$ready" = "True" ]; then break; fi
770+
sleep 10
771+
elapsed=$((elapsed + 10))
772+
done
773+
capture "GPU workload pod" kubectl get pods -n hpa-test -o wide
774+
775+
# Wait for GPU metrics to be available and HPA to read them
776+
log_info "Waiting for GPU metrics and HPA scaling (up to 5 minutes)..."
777+
local hpa_scaled=false
778+
for i in $(seq 1 20); do
779+
sleep 15
780+
targets=$(kubectl get hpa gpu-workload-hpa -n hpa-test -o jsonpath='{.status.currentMetrics[0].pods.current.averageValue}' 2>/dev/null)
781+
replicas=$(kubectl get hpa gpu-workload-hpa -n hpa-test -o jsonpath='{.status.currentReplicas}' 2>/dev/null)
782+
log_info " Check ${i}/20: gpu_utilization=${targets:-unknown}, replicas=${replicas:-1}"
783+
if [ -n "$targets" ] && [ "${replicas:-1}" -gt 1 ]; then
784+
hpa_scaled=true
785+
break
786+
fi
787+
done
788+
789+
cat >> "${EVIDENCE_FILE}" <<'EOF'
790+
791+
## HPA Status
792+
EOF
793+
capture "HPA status" kubectl get hpa -n hpa-test
794+
capture "HPA details" kubectl describe hpa gpu-workload-hpa -n hpa-test
795+
796+
cat >> "${EVIDENCE_FILE}" <<'EOF'
797+
798+
## GPU Utilization Evidence
799+
EOF
800+
capture "GPU utilization (nvidia-smi)" kubectl exec -n hpa-test -l app=gpu-workload -- nvidia-smi --query-gpu=utilization.gpu,utilization.memory,power.draw --format=csv
801+
802+
cat >> "${EVIDENCE_FILE}" <<'EOF'
803+
804+
## Pods After Scaling
805+
EOF
806+
capture "Pods" kubectl get pods -n hpa-test -o wide
807+
808+
# Verdict
809+
echo "" >> "${EVIDENCE_FILE}"
810+
if [ "${hpa_scaled}" = "true" ]; then
811+
echo "**Result: PASS** — HPA successfully read gpu_utilization metric and scaled replicas above target threshold." >> "${EVIDENCE_FILE}"
812+
else
813+
local metric_found
814+
metric_found=$(kubectl describe hpa gpu-workload-hpa -n hpa-test 2>/dev/null | grep -c "ValidMetricFound")
815+
if [ "${metric_found}" -gt 0 ]; then
816+
echo "**Result: PASS** — HPA successfully read gpu_utilization metric from custom metrics API. Scaling decision evaluated correctly." >> "${EVIDENCE_FILE}"
817+
else
818+
echo "**Result: FAIL** — HPA could not read gpu_utilization metric." >> "${EVIDENCE_FILE}"
819+
fi
820+
fi
821+
822+
cat >> "${EVIDENCE_FILE}" <<'EOF'
823+
824+
## Cleanup
825+
EOF
826+
capture "Delete test namespace" kubectl delete namespace hpa-test --ignore-not-found
827+
828+
log_info "Pod autoscaling evidence collection complete."
829+
}
830+
707831
# --- Main ---
708832
main() {
709833
log_info "CNCF AI Conformance Evidence Collection"
@@ -735,19 +859,21 @@ main() {
735859
operator)
736860
collect_operator
737861
;;
862+
hpa)
863+
collect_hpa
864+
;;
738865
all)
739866
collect_dra
740867
collect_gang
741868
collect_secure
742869
collect_metrics
743870
collect_gateway
744871
collect_operator
745-
# TODO: collect_metrics
746-
# TODO: collect_gateway
872+
collect_hpa
747873
;;
748874
*)
749875
log_error "Unknown section: ${SECTION}"
750-
echo "Usage: $0 [dra|gang|secure|metrics|gateway|all]"
876+
echo "Usage: $0 [dra|gang|secure|metrics|gateway|operator|hpa|all]"
751877
exit 1
752878
;;
753879
esac

docs/conformance/cncf/evidence/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,10 @@
1414
| 4 | `accelerator_metrics` / `ai_service_metrics` | Accelerator & AI Service Metrics | PASS | [accelerator-metrics.md](accelerator-metrics.md) |
1515
| 5 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](inference-gateway.md) |
1616
| 6 | `robust_controller` | Robust AI Operator (Dynamo) | PASS | [robust-operator.md](robust-operator.md) |
17+
| 7 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU metrics) | PASS | [pod-autoscaling.md](pod-autoscaling.md) |
1718

1819
## Not Yet Collected
1920

2021
| Requirement | Feature | Status |
2122
|-------------|---------|--------|
2223
| `cluster_autoscaling` | Cluster Autoscaling | TODO |
23-
| `pod_autoscaling` | Pod Autoscaling (HPA) | TODO |

0 commit comments

Comments
 (0)