Skip to content

feat: add expected-resources deployment check for validating Kubernetes resources exist#149

Merged
mchmarny merged 5 commits intoNVIDIA:mainfrom
xdu31:feat/deployment-check
Feb 19, 2026
Merged

feat: add expected-resources deployment check for validating Kubernetes resources exist#149
mchmarny merged 5 commits intoNVIDIA:mainfrom
xdu31:feat/deployment-check

Conversation

@xdu31
Copy link
Contributor

@xdu31 xdu31 commented Feb 19, 2026

Summary

Add expected-resources deployment check that validates Kubernetes resources (Deployments, DaemonSets, StatefulSets) declared in recipe componentRefs[].expectedResources actually exist in the cluster.

Motivation / Context

Previously, there was no way to verify that critical resources like gpu-operator Deployments and DaemonSets were actually present in the cluster after deployment. This check closes that gap by validating expected resources exist during the deployment validation phase.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/eidos, pkg/cli)
  • API server (cmd/eidosd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

  • expected_resources_check.go: Implements the check by reading expectedResources from recipe componentRefs and verifying each resource exists in the cluster via the Kubernetes API. Supports Deployment, DaemonSet, StatefulSet, Service, ConfigMap, and Secret kinds.
  • phases.go: Wires expected-resources into the deployment phase check registry and passes recipe data to the validator Job via environment variables.
  • metadata_store.go: Bug fix — initBaseMergedSpec() was not copying Validation config from the base overlay, causing generated recipes to silently drop validation sections.
  • recipes/overlays/base.yaml: Adds expectedResources on gpu-operator (3 core resources) and validation.deployment.checks: [expected-resources] so all recipes inherit deployment validation.
  • CUJ1 e2e test: Updated assertion from deployment: skipped to deployment: pass with expected-resources check.
  • tests/e2e/run.sh: Added test_validate_expected_resources() for live cluster testing (pass and fail scenarios).

Testing

# Commands run
make qualify
  • Unit tests: 425-line table-driven test covering all supported kinds, missing resources, mixed results, empty inputs, namespace handling
  • Chainsaw CLI e2e: 11/11 tests pass (CUJ1 asserts expected-resources check in multiphase validation)
  • Live cluster (Kind): Verified pass case (resource exists) and fail case (missing resource) with locally-built validator image

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Non-breaking. Existing recipes without expectedResources or validation config are unaffected. The check only runs when both are present in the recipe.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good PR overall — follows existing check patterns, solid unit test coverage, and the metadata_store bug fix is a necessary companion change.

Key items to address:

  • Timeout constants (10s in verifyResource, phase defaults) should move to pkg/defaults per project convention
  • Secret kind is mentioned in the PR description but not implemented in the switch — either add it or correct the description
  • Nil-guard removal on recipeResult in validatePerformance is a separate behavioral change worth calling out
  • E2e fail-case test can never actually fail due to warn + pass fallback

See inline comments for details.

@github-actions
Copy link

@xdu31 this PR now has merge conflicts with main. Please rebase to resolve them.

@xdu31 xdu31 requested a review from mchmarny February 19, 2026 17:01
@xdu31 xdu31 force-pushed the feat/deployment-check branch from 8d00dd6 to e240fd5 Compare February 19, 2026 17:11
Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mchmarny mchmarny merged commit 0463e2d into NVIDIA:main Feb 19, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants