Skip to content

feat(conformance): capture observed state in evidence artifacts#204

Merged
dims merged 1 commit intoNVIDIA:mainfrom
dims:dims/worktree-conformance-observed-artifacts
Feb 25, 2026
Merged

feat(conformance): capture observed state in evidence artifacts#204
dims merged 1 commit intoNVIDIA:mainfrom
dims:dims/worktree-conformance-observed-artifacts

Conversation

@dims
Copy link
Collaborator

@dims dims commented Feb 24, 2026

Summary

  • Replace hardcoded congratulatory strings in conformance evidence artifacts with actual observed cluster state
  • Each behavioral check now returns a typed report struct capturing real values (HPA desired/current replicas, node counts, scheduling timestamps, webhook rejection codes, etc.)
  • Fixes TestSecureAcceleratorAccess flaky failure on GPU Inference CI (was timing out at ~309s, now completes in ~12s)
  • Deduplicates HPA scaling intent poll into shared helper

Changes

  • cluster_autoscaling_check.go — capture HPA desired/current, baseline/observed node counts, scheduled/total pods
  • pod_autoscaling_check.go — capture HPA desired/current, scale-up/down deployment replicas
  • robust_controller_check.go — capture webhook rejection HTTP code, reason, and message
  • secure_access_check.go — capture isolation pod name, phase, exit code, resource claims count; pin no-claim pod to GPU node via NodeName; add stuck-pod detection and rate limiter handling
  • gang_scheduling_check.go — capture scheduling timestamps, co-schedule span
  • inference_gateway_check.go — capture listener routes, HTTPRoute count, endpoint counts, condition details
  • ai_service_metrics_check.go — capture HTTP status, group version, API resource count
  • helpers.go — add podStuckReason(), podWaitingStatus(), and shared waitForHPAScaleUp() helpers
  • Unit tests updated to verify report struct values

Test plan

  • go test -race ./pkg/validator/... passes locally
  • CI: Unit, Integration, Conformance, E2E tests pass
  • CI: GPU Training Test (nvkind + H100 x2) passes
  • CI: GPU Inference Test (nvkind + H100) passes — TestSecureAcceleratorAccess completes in 12s

@dims dims requested a review from a team as a code owner February 24, 2026 17:43
@dims dims changed the title feat(conformance): capture observed state in evidence artifacts [DO-NOT-MERGE] feat(conformance): capture observed state in evidence artifacts Feb 24, 2026
@dims dims changed the title [DO-NOT-MERGE] feat(conformance): capture observed state in evidence artifacts feat(conformance): capture observed state in evidence artifacts Feb 24, 2026
@dims dims force-pushed the dims/worktree-conformance-observed-artifacts branch 3 times, most recently from 090e287 to 2cc21ca Compare February 24, 2026 22:26
@dims dims requested a review from a team as a code owner February 24, 2026 22:26
@dims dims force-pushed the dims/worktree-conformance-observed-artifacts branch from 1f3188a to 9716d10 Compare February 24, 2026 22:47
Replace hardcoded congratulatory strings in conformance evidence
artifacts with actual observed cluster state. Each behavioral check
now returns a typed report struct capturing real values (HPA
desired/current replicas, node counts, scheduling timestamps,
webhook rejection codes, etc.).

Also fixes TestSecureAcceleratorAccess flaky failure on GPU Inference
CI by:
- Pinning the no-claim isolation pod to the GPU node via NodeName,
  ensuring isolation is proven on a node that actually has GPUs and
  bypassing scheduler-level delays
- Adding podStuckReason() helper for fast failure on ImagePullBackOff,
  CrashLoopBackOff, and Unschedulable states
- Treating K8s client rate limiter errors as retriable
- Pinning busybox image to 1.37 (matching HPA tests)
- Adding diagnostic output (phase, container status, node) to timeout
  error messages

Deduplicates waitForHPAScalingIntent / waitForClusterAutoHPAScale
into a shared waitForHPAScaleUp helper in helpers.go.
@dims dims force-pushed the dims/worktree-conformance-observed-artifacts branch from 8bf599c to 91919ca Compare February 24, 2026 23:55
@dims
Copy link
Collaborator Author

dims commented Feb 25, 2026

Fixing comments from @mchmarny in #201 (comment)

@dims dims merged commit dadaafb into NVIDIA:main Feb 25, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant