Skip to content

feat(validator): upgrade conformance checks from static to behavioral validation#185

Merged
dims merged 1 commit intoNVIDIA:mainfrom
dims:behavioral-conformance-upgrades
Feb 23, 2026
Merged

feat(validator): upgrade conformance checks from static to behavioral validation#185
dims merged 1 commit intoNVIDIA:mainfrom
dims:behavioral-conformance-upgrades

Conversation

@dims
Copy link
Copy Markdown
Collaborator

@dims dims commented Feb 23, 2026

Summary

  • Upgrades 5 CNCF AI Conformance v1.34 checks from static/presence validation to behavioral/functional testing, following the self-contained pattern from secure_access_check.go (create resources → wait for behavior → validate → cleanup)
  • Each check now proves the validated system actually works, not just that its resources exist

Checks upgraded

Check What was validated What is now validated
secure-accelerator-access DRA pod succeeds with ResourceClaim + Negative isolation: pod without claims cannot access GPU devices
robust-controller Dynamo webhook + CRDs exist + Webhook actually rejects invalid DynamoGraphDeployment
inference-gateway GatewayClass accepted, Gateway programmed, CRDs exist + Data-plane readiness: inference-gateway proxy EndpointSlices have ready endpoints
pod-autoscaling Custom/external metrics APIs have GPU data + HPA reads metrics, computes scale-up (desiredReplicas > currentReplicas), Deployment actually scales
cluster-autoscaling Karpenter deployed, GPU NodePool exists + Full chain: HPA → scale → Karpenter provisions KWOK nodes → pods scheduled

Key design decisions

  • Autoscaling checks are NOT KWOK-specific — they use external metrics (dcgm_gpu_power_usage) which are cluster-wide, working on any cluster with DCGM + prometheus-adapter
  • HPA strict criterion: desiredReplicas > currentReplicas only (not ScalingActive=True which can be true without scale intent)
  • Unique per-run namespaces with random suffix prevent cross-run interference from async cleanup
  • Namespace cleanup uses context.Background() with bounded timeout so cleanup runs even when parent context is canceled
  • Cluster-autoscaling discovers all GPU NodePools and tries each in the behavioral chain (first success wins)
  • Removed redundant --exercise steps from CI workflows (behavioral validation now runs inside aicr validate)

Test plan

  • All 42 packages pass with -race detector
  • New test cases: multi-NodePool discovery, non-gateway EndpointSlice filtering, deployment scale verification, pod scheduling poll loop, namespace cleanup on canceled context
  • GPU CI workflows pass end-to-end (inference + training)

@dims dims requested review from a team as code owners February 23, 2026 04:52
@dims dims force-pushed the behavioral-conformance-upgrades branch 4 times, most recently from 8d069cf to 8146aa9 Compare February 23, 2026 11:34
… validation

Upgrade 5 conformance checks from static/presence validation to
behavioral/functional testing per CNCF AI Conformance v1.34 section 11.1:

- cluster-autoscaling: Full behavioral chain — HPA scaling intent,
  Karpenter KWOK node provisioning, pod scheduling verification.
  Discovers GPU NodePools dynamically, tries each until one succeeds.
- pod-autoscaling: HPA reads external GPU metrics (dcgm_gpu_power_usage)
  and computes scale-up, Deployment actually scales replicas.
- inference-gateway: Data-plane readiness via EndpointSlice verification,
  HTTPRoute discovery (informational).
- robust-controller: Webhook rejection test — creates invalid
  DynamoGraphDeployment, verifies admission webhook rejects it.
- secure-accelerator-access: Negative isolation test — pod without
  ResourceClaims cannot access GPU devices.

Removes kwok/scripts/validate-cluster-autoscaling.sh (setup logic inlined
in CI workflows, exercise logic replaced by Go conformance check).
@dims dims force-pushed the behavioral-conformance-upgrades branch from 8146aa9 to c6cc54d Compare February 23, 2026 11:45
@dims dims merged commit f1411b6 into NVIDIA:main Feb 23, 2026
33 checks passed
dims referenced this pull request in dims/aicr Feb 25, 2026
The GPU Conformance Test (nvkind + H100 x2) workflow was created on
PR #180's branch but never merged to main. This adds it with an
updated schedule (08:45/20:45 UTC) to maintain a 2h15m gap from the
GPU Training Test (06:30/18:30 UTC), ensuring the two H100 x2 jobs
don't compete for the same runner.

Schedule layout (all 2x daily, 12h apart):
  - T4 Smoke:          06:00 / 18:00 UTC
  - H100 Inference:    06:15 / 18:15 UTC
  - H100 Training x2:  06:30 / 18:30 UTC
  - H100 Conformance:  08:45 / 20:45 UTC  (2h15m after training)

Aligned with current CI patterns:
  - gpu-snapshot-validate action instead of inline snapshot steps
  - Karpenter nodepool.yaml applied after install
  - load-versions + setup-build-tools for chainsaw install
  - Dockerfile.validator and missing action paths in path triggers
  - Step ordering and naming consistent with inference/training
  - Removed redundant DRA/gang pre-deploy steps that would exhaust
    GPU claim capacity before the self-contained conformance checks
    run inside aicr validate (introduced in PRs #184, #185)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant