Skip to content

feat(validator): add per-check isolation and external validator support#299

Closed
xdu31 wants to merge 2 commits intoNVIDIA:mainfrom
xdu31:feat/validator-isolation
Closed

feat(validator): add per-check isolation and external validator support#299
xdu31 wants to merge 2 commits intoNVIDIA:mainfrom
xdu31:feat/validator-isolation

Conversation

@xdu31
Copy link
Contributor

@xdu31 xdu31 commented Mar 6, 2026

Summary

Add per-check isolation and external validator support to the validation framework. Recipe authors can now run individual checks/constraints in their own Kubernetes Jobs for fault isolation, and bring their own OCI containers as external validators.

Motivation

Today all checks in a phase run in a single Job. A crash or resource leak in one check kills the entire phase. This PR adds a cascading isolated flag (individual > phase > top-level > default false) that lets recipe authors control execution granularity, plus a validators field for external OCI containers.

Recipe schema

validation:
  isolated: false                         # top-level default
  deployment:
    isolated: true                        # phase override
    checks:
      - expected-resources                # string shorthand (backward compatible)
      - name: heavy-gpu-test              # object form with overrides
        isolated: true
        timeout: 10m
    constraints:
      - name: Deployment.gpu-operator.version
        value: ">= v24.6.0"
        isolated: true
    validators:                           # external OCI containers
      - name: custom-check
        image: myregistry.io/check:v1.0
        timeout: 5m

Execution model

Phase: deployment
├── Tier 1: Shared Job (combined -run pattern for non-isolated items)
├── Tier 2: Isolated Jobs (individual -run pattern, same validator image)
└── Tier 3: External Jobs (user-provided image, exit-code protocol)

Changes

Recipe types (pkg/recipe/metadata.go)

  • Add CheckRef union type (string or object) with custom UnmarshalYAML
  • Add ExternalValidator type (name, image, timeout)
  • Add Isolated *bool and Timeout to Constraint, ValidationPhase, ValidationConfig
  • Change ValidationPhase.Checks from []string to []CheckRef (backward compatible)

Isolation logic (pkg/validator/isolation.go — new)

  • resolveIsolated(): cascading precedence (individual > phase > top-level > false)
  • partitionByIsolation(): splits checks/constraints into shared vs isolated groups

Phase runners (pkg/validator/phases_runners.go)

  • Extract executePhaseChecks() — central 3-tier orchestrator replacing duplicated logic in deployment/performance/conformance runners
  • Add runExternalJob() for external validator Jobs
  • Add helpers: sanitizeLabelValue, resolveItemTimeout, baseJobConfig

Job agent (pkg/validator/agent/)

  • Add Labels map[string]string to Config — structured pod labels (run-id, phase, tier, check/constraint/validator)
  • Add ExternalCommand flag — omits go test command, uses container ENTRYPOINT
  • Add TerminationMessageFallbackToLogsOnError policy for external Jobs
  • Merge config labels into pod template alongside base app.kubernetes.io/* labels
  • Use batch.kubernetes.io/job-name for pod lookup (Kubernetes-native, no custom label needed)

RunID (pkg/validator/validator.go)

  • Shorten from 16 hex chars to 4 hex chars (YYYYMMDD-HHMMSS-XXXX, 20 chars total)
  • Keeps derived Job names within the Kubernetes 63-character label value limit

Results (pkg/validator/result.go)

  • Add Source field (shared, isolated, external) to CheckResult

Defaults (pkg/defaults/timeouts.go)

  • Add ExternalValidatorTimeout (10m) and ExternalValidatorLogTailLines (10)

Demo (demos/isolation/)

  • End-to-end walkthrough on Kind cluster exercising all 3 tiers
  • External validator example: DNS check (Dockerfile + check.sh)
  • Captured output showing structured pod labels and result YAML

Structured pod labels

Each validation pod gets structured labels for querying:

Label Tier 1 (shared) Tier 2 (isolated) Tier 3 (external)
aicr.nvidia.com/run-id run ID run ID run ID
aicr.nvidia.com/phase phase name phase name phase name
aicr.nvidia.com/tier shared isolated external
aicr.nvidia.com/check check name
aicr.nvidia.com/constraint constraint name
aicr.nvidia.com/validator validator name
kubectl get pods -l aicr.nvidia.com/tier=isolated -n aicr-validation
kubectl get pods -l aicr.nvidia.com/check=expected-resources -n aicr-validation

Files changed (24 files, +1702/−367)

File Change
pkg/recipe/metadata.go CheckRef, ExternalValidator types, Isolated fields
pkg/recipe/metadata_test.go CheckRef YAML deserialization tests
pkg/recipe/conformance_test.go Updated for CheckRef
pkg/validator/isolation.go New — isolation resolution + partitioning
pkg/validator/isolation_test.go New — 20 test cases
pkg/validator/phases_runners.go 3-tier orchestrator, external validator support
pkg/validator/phases_test.go Isolation, sanitization, timeout, pattern tests
pkg/validator/result.go Source field, CheckSource* constants
pkg/validator/validator.go RunID shortened to 4 hex chars
pkg/validator/validator_test.go Updated RunID format assertion
pkg/validator/agent/types.go Labels, ExternalCommand, Cleanup fields
pkg/validator/agent/job.go Label merging, external command mode
pkg/validator/agent/job_test.go External/internal/label tests
pkg/validator/agent/wait.go batch.kubernetes.io/job-name pod lookup
pkg/validator/agent/wait_test.go Updated pod label assertions
pkg/validator/checks/runner.go CheckRef compatibility
pkg/validator/checks/runner_test.go Updated for CheckRef
pkg/validator/README.md Updated RunID format
pkg/defaults/timeouts.go External validator constants
demos/isolation/README.md New — end-to-end walkthrough
demos/isolation/recipe.yaml New — mixed isolation demo recipe
demos/isolation/result.yaml New — captured result output
demos/isolation/external-validator/Dockerfile New — minimal BYO validator
demos/isolation/external-validator/check.sh New — DNS check script

Test plan

  • make test (unit tests with race detector)
  • make lint (0 issues)
  • Kind cluster demo — all 3 tiers pass, structured labels verified on pods
  • Backward compatibility — string-form checks: ["expected-resources"] still works

@xdu31 xdu31 requested a review from a team as a code owner March 6, 2026 06:55
@xdu31 xdu31 marked this pull request as draft March 6, 2026 06:55
Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR appears to significant overlap with #290. Any reason why we would drive consensus there before writing more code? I think we should resolve the architectural direction before merging this. I will comment on #290 with more context.

@xdu31 xdu31 closed this Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants