feat(validator): upgrade conformance checks from static to behavioral validation by dims · Pull Request #185 · NVIDIA/aicr

dims · 2026-02-23T04:52:36Z

Summary

Upgrades 5 CNCF AI Conformance v1.34 checks from static/presence validation to behavioral/functional testing, following the self-contained pattern from secure_access_check.go (create resources → wait for behavior → validate → cleanup)
Each check now proves the validated system actually works, not just that its resources exist

Checks upgraded

Check	What was validated	What is now validated
`secure-accelerator-access`	DRA pod succeeds with ResourceClaim	+ Negative isolation: pod without claims cannot access GPU devices
`robust-controller`	Dynamo webhook + CRDs exist	+ Webhook actually rejects invalid DynamoGraphDeployment
`inference-gateway`	GatewayClass accepted, Gateway programmed, CRDs exist	+ Data-plane readiness: inference-gateway proxy EndpointSlices have ready endpoints
`pod-autoscaling`	Custom/external metrics APIs have GPU data	+ HPA reads metrics, computes scale-up (desiredReplicas > currentReplicas), Deployment actually scales
`cluster-autoscaling`	Karpenter deployed, GPU NodePool exists	+ Full chain: HPA → scale → Karpenter provisions KWOK nodes → pods scheduled

Key design decisions

Autoscaling checks are NOT KWOK-specific — they use external metrics (dcgm_gpu_power_usage) which are cluster-wide, working on any cluster with DCGM + prometheus-adapter
HPA strict criterion: desiredReplicas > currentReplicas only (not ScalingActive=True which can be true without scale intent)
Unique per-run namespaces with random suffix prevent cross-run interference from async cleanup
Namespace cleanup uses context.Background() with bounded timeout so cleanup runs even when parent context is canceled
Cluster-autoscaling discovers all GPU NodePools and tries each in the behavioral chain (first success wins)
Removed redundant --exercise steps from CI workflows (behavioral validation now runs inside aicr validate)

Test plan

All 42 packages pass with -race detector
New test cases: multi-NodePool discovery, non-gateway EndpointSlice filtering, deployment scale verification, pod scheduling poll loop, namespace cleanup on canceled context
GPU CI workflows pass end-to-end (inference + training)

… validation Upgrade 5 conformance checks from static/presence validation to behavioral/functional testing per CNCF AI Conformance v1.34 section 11.1: - cluster-autoscaling: Full behavioral chain — HPA scaling intent, Karpenter KWOK node provisioning, pod scheduling verification. Discovers GPU NodePools dynamically, tries each until one succeeds. - pod-autoscaling: HPA reads external GPU metrics (dcgm_gpu_power_usage) and computes scale-up, Deployment actually scales replicas. - inference-gateway: Data-plane readiness via EndpointSlice verification, HTTPRoute discovery (informational). - robust-controller: Webhook rejection test — creates invalid DynamoGraphDeployment, verifies admission webhook rejects it. - secure-accelerator-access: Negative isolation test — pod without ResourceClaims cannot access GPU devices. Removes kwok/scripts/validate-cluster-autoscaling.sh (setup logic inlined in CI workflows, exercise logic replaced by Go conformance check).

The GPU Conformance Test (nvkind + H100 x2) workflow was created on PR #180's branch but never merged to main. This adds it with an updated schedule (08:45/20:45 UTC) to maintain a 2h15m gap from the GPU Training Test (06:30/18:30 UTC), ensuring the two H100 x2 jobs don't compete for the same runner. Schedule layout (all 2x daily, 12h apart): - T4 Smoke: 06:00 / 18:00 UTC - H100 Inference: 06:15 / 18:15 UTC - H100 Training x2: 06:30 / 18:30 UTC - H100 Conformance: 08:45 / 20:45 UTC (2h15m after training) Aligned with current CI patterns: - gpu-snapshot-validate action instead of inline snapshot steps - Karpenter nodepool.yaml applied after install - load-versions + setup-build-tools for chainsaw install - Dockerfile.validator and missing action paths in path triggers - Step ordering and naming consistent with inference/training - Removed redundant DRA/gang pre-deploy steps that would exhaust GPU claim capacity before the self-contained conformance checks run inside aicr validate (introduced in PRs #184, #185)

dims requested review from a team as code owners February 23, 2026 04:52

github-actions bot added area/ci area/validator size/XL labels Feb 23, 2026

dims force-pushed the behavioral-conformance-upgrades branch 4 times, most recently from 8d069cf to 8146aa9 Compare February 23, 2026 11:34

dims force-pushed the behavioral-conformance-upgrades branch from 8146aa9 to c6cc54d Compare February 23, 2026 11:45

dims merged commit f1411b6 into NVIDIA:main Feb 23, 2026
33 checks passed

This was referenced Feb 23, 2026

feat: add cluster autoscaling evidence for CNCF AI Conformance #193

Merged

feat: enhance conformance evidence with gateway conditions, webhook test, and HPA scale-down #205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(validator): upgrade conformance checks from static to behavioral validation#185

feat(validator): upgrade conformance checks from static to behavioral validation#185
dims merged 1 commit intoNVIDIA:mainfrom
dims:behavioral-conformance-upgrades

dims commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dims commented Feb 23, 2026

Summary

Checks upgraded

Key design decisions

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant