Skip to content

refactor(validator): remove Job-based checks from readiness phase, keep constraint-only gate#195

Merged
mchmarny merged 5 commits intoNVIDIA:mainfrom
xdu31:feat/cleanup-readiness
Feb 24, 2026
Merged

refactor(validator): remove Job-based checks from readiness phase, keep constraint-only gate#195
mchmarny merged 5 commits intoNVIDIA:mainfrom
xdu31:feat/cleanup-readiness

Conversation

@xdu31
Copy link
Copy Markdown
Contributor

@xdu31 xdu31 commented Feb 23, 2026

Summary

Remove Job-based checks from the readiness validation phase, making it a fast, offline, constraint-only gate. Readiness now evaluates recipe constraints inline against snapshot data (K8s version, OS, kernel) with no cluster access, no Kubernetes Jobs, and no pre-compiled test binaries.

Motivation / Context

The readiness phase previously did two things:

  1. Evaluated recipe constraints inline against the snapshot
  2. Deployed Kubernetes Jobs for cluster checks (gpu-hardware-detection, kernel-parameters, os-prerequisites)

The Job-based checks added complexity (pre-compiled test binary, Dockerfile build step, RBAC) without value — GPU detection and similar hardware checks are better handled by the deployment phase or constraints. Removing the Job infrastructure from readiness makes it a fast, offline gate that runs before any cluster interaction.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other

Implementation Notes

Deleted: pkg/validator/checks/readiness/

  • gpu_detection.go — the only readiness check implementation
  • gpu_detection_test.go — its tests

Modified: pkg/validator/phases.go

  • Stripped validateReadiness() of the entire Job deployment block (~60 lines removed)
  • Changed ctx context.Context parameter to _ context.Context (unused without Jobs)
  • Removed Checks: []CheckResult{} from PhaseResult initialization
  • Simplified status determination to only consider constraints

Modified: Dockerfile.validator

  • Removed readiness.test pre-compilation from build stage
  • Removed COPY readiness.test from final image

Modified: pkg/validator/checks/runner.go

  • Removed "readiness" case from HasCheck() switch
  • Updated doc examples from readiness to deployment

Modified: pkg/validator/checks/registration_test.go

  • Removed readiness side-effect import
  • Removed "readiness" from phaseDirs

Modified: Test files

  • phases_test.go — simplified TestValidateReadiness to constraint-only assertions
  • runner_test.go — removed readiness test cases from TestTestRunner_HasCheck
  • registry_test.go — replaced readiness mock data with deployment/conformance entries
  • metadata_test.go — replaced os-prerequisites check with constraint in merge test
  • builder_test.go — extracted goconst lint fix for repeated string

Modified: Documentation

  • DEVELOPMENT.md — updated readiness phase description
  • docs/user/cli-reference.md — updated phase table and output examples (removed readiness checks)
  • docs/integrator/recipe-development.md — removed readiness checks from validation example
  • examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml — removed readiness checks block
  • pkg/validator/README.md — updated phase table, execution diagram, examples, design rationale
  • pkg/validator/checks/README.md — updated phase table, directory structure, naming conventions, examples
  • pkg/validator/agent/README.md — updated RBAC, usage examples

Behavioral Changes

Aspect Before After
Readiness checks Deployed as K8s Jobs (gpu-hardware-detection, kernel-parameters, os-prerequisites) None — constraint-only
Readiness cluster access Required (for Job deployment) Not required
readiness.test binary Pre-compiled and shipped in image Removed
ctx parameter Used for Job deployment Unused (_)
PhaseResult.Checks Empty slice for readiness Not set

Testing

# Unit tests — all pass with race detector
go test -race ./pkg/validator/... -count=1
go test -race ./pkg/recipe/... -count=1

# Lint — 0 issues
golangci-lint run ./pkg/recipe/... ./pkg/validator/...

Risk Assessment

  • Low — Removes unused infrastructure, no interface changes, readiness checks were never configured in production recipes
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Any recipe overlay that defines validation.readiness.checks will have those checks silently ignored at runtime (the Checks field still exists on the shared ValidationPhase struct but validateReadiness() no longer reads it). No existing overlays define readiness checks.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mchmarny mchmarny added the enhancement New feature or request label Feb 24, 2026
@mchmarny mchmarny merged commit 46602b1 into NVIDIA:main Feb 24, 2026
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants