Skip to content

feat(cli): update-state + targeted reset --package (closes #257)#258

Open
ayuskauskas wants to merge 17 commits into
mainfrom
feat/cli-update-state
Open

feat(cli): update-state + targeted reset --package (closes #257)#258
ayuskauskas wants to merge 17 commits into
mainfrom
feat/cli-update-state

Conversation

@ayuskauskas
Copy link
Copy Markdown
Collaborator

Summary

  • Adds kubectl skyhook update-state <skyhook> <package> <version> <stage> <state> for surgical edits to the per-node nodeState annotation, with an --add mode (gated on --node/--selector) for creating fresh entries.
  • Adds kubectl skyhook reset --package <name>[:<version>] for targeted single-package state cleanup.
  • Shared refactor: fixes utils.SetNodeAnnotation JSON-quoting bug (was using %q Go quoting); extracts utils.ConfirmYN and utils.ListNodesWithSkyhookState; gates new commands on operator >= v0.15.0.

Closes #257.

Highlights

  • Pause-the-CR warning prominently documented in docs/cli.md — both commands do a read-modify-write on the annotation without resource-version concurrency control, so the operator can clobber the edit if it isn't paused.
  • Version gating uses existing utils.GetSkyhookVersion / utils.DiscoverOperatorVersion infrastructure with a new MinNodeStateSupportVersion = "v0.15.0" constant. Older operators get a clear error rather than silent corruption.
  • --add requires explicit --node or --selector so a fresh-entry creation can never accidentally touch every node in the cluster.
  • reset --package deliberately skips batch reset (// why: comment in code) — removing one package's state shouldn't restart the entire rollout from batch 1.

Test plan

  • make unit-tests — 10 suites, all green (envtest)
  • make build — operator + CLI clean
  • make lint — 0 issues
  • Ginkgo coverage: ~70 specs in cmd/cli/app including 11 update-state cases, 4 --package cases, version-gating rejection cases for both commands, list-failure regression cases
  • Chainsaw fixtures added: update-state/, reset-package/, update-state-reject/ (happy + rejection paths)
  • Chainsaw e2e in CI (requires kind cluster)
  • Manual smoke against a real cluster

Behavior changes (incidental)

  • utils.SetNodeAnnotation now JSON-marshals the patch body. No existing caller passes non-ASCII, so no regression — but the previous %q formatting was technically wrong and would have produced invalid JSON for multi-byte annotation values.

Notes for review

  • Plan lived at docs/plans/2026-05-26-cli-update-state.md (gitignored).
  • Operator version gating uses the same pattern as lifecycle.go:99-132.
  • Concurrent-modification race is documented loudly rather than handled in code — the issue's acceptance criteria did not require resource-version-based concurrency control, and a "pause the CR" workflow is the safer admin discipline.

…irmYN and ListNodesWithSkyhookState

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
The refactor to utils.ListNodesWithSkyhookState left the reset command
treating a hard list-from-apiserver failure (RBAC denied, unreachable
apiserver) the same as a per-node parse failure: a single Warning log
under --verbose, then control falls through to the "No nodes have
state" early-return. Restore the pre-refactor behavior by aborting
when the helper returns a nil result map (its signal for list
failure), while still surfacing parse-only failures as warnings.

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
… use METADATA_PREFIX

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
….go pattern

Mirror reset.go's list-error handling so an apiserver/RBAC failure surfaces
as a wrapped error instead of being silently swallowed under non-verbose
mode. Drop the package-level client-injection seam (newUpdateStateClient)
and inline the client construction in RunE, matching reset.go; rewrite the
runUpdateState specs to call the runner directly with a fake client like
reset_test.go does. Tighten the dry-run test to also cover the non-empty
targets path (asserts [dry-run] output and that the annotation is
unchanged), and assert Name/Version on the --add create test.

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
…anup

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
…ause requirement

Documents the new `update-state` command (surgical edit to the per-node
nodeState annotation) including a loud pause-the-CR callout, the `--add`
flag and its requirement for explicit scope, and adds `--package` to the
existing `reset` command. Updates the command compatibility matrix with
new rows and notes the conservative operator version pin since the
annotation shape has been stable across recent operator versions.

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
…nnotation fix

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
… clone NodeState before mutation; add chainsaw rejection coverage

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new kubectl-skyhook subcommand update-state that performs read-modify-write edits to per-node skyhook nodeState annotations (stage/state validation, optional --node/--selector, --add creation mode, operator-version gating). Extends reset with --package [:] to remove only a package entry from nodeState without touching deployment batch state. Provides JSON-safe SetNodeAnnotation, ListNodesWithSkyhookState discovery, ConfirmYN prompt, unit tests, Chainsaw e2e manifests, and CHANGELOG/docs updates. Minimum nodeState operator support constant set to v0.7.5.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • mskalka
  • rice-riley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(cli): update-state + targeted reset --package (closes #257)' clearly summarizes the main changes: adding update-state command and targeted reset with --package flag, both core to the PR.
Description check ✅ Passed The description comprehensively covers the PR's main changes including the new commands, shared refactoring, version gating, documentation updates, and test plan with clear sections on highlights and behavior changes.
Linked Issues check ✅ Passed All objectives from issue #257 are implemented: update-state command with validation and scoping, reset --package with version matching, operator version gating (v0.15.0), documentation updates, and Chainsaw fixtures for both happy and rejection paths.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the stated objectives: CLI commands for node-state annotation edits, documentation updates, utility refactoring for these commands, test additions for the new features, and pool label additions for test organization—no unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/cli-update-state

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
k8s-tests/chainsaw/cli/reset-package/skyhook.yaml (1)

21-42: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit uninstall policy fields in the Skyhook spec.

This manifest omits spec.uninstall.enabled and spec.uninstall.apply, which are required to make uninstall behavior explicit and deterministic in test scenarios.

As per coding guidelines, "Explicitly set uninstall.enabled and uninstall.apply fields to control package uninstall behavior in Skyhook CRs".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/reset-package/skyhook.yaml` around lines 21 - 42, Add
an explicit spec.uninstall block to the Skyhook manifest containing the
uninstall.enabled and uninstall.apply fields so uninstall behavior is
deterministic; update the Skyhook CR that includes packages pkg1 and pkg2 by
adding e.g. spec.uninstall: enabled: true and apply: true (or set the values
required by the test) immediately under spec, ensuring both uninstall.enabled
and uninstall.apply are present for the manifest that defines pkg1 and pkg2.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/cli.md`:
- Around line 438-463: The example showing the package-specific reset is
redundant because --skip-batch-reset is already implied when using the reset
subcommand with --package; update the example under the "Surgical Recovery"
section so it either removes the redundant --skip-batch-reset flag (use "kubectl
skyhook reset my-skyhook --package my-package:1.0 --confirm") or keep the flag
but add a brief inline comment clarifying that --skip-batch-reset is shown for
explicitness but is implied by --package; ensure the change references the reset
command and the --package/--skip-batch-reset flags so reviewers can find the
sample to modify.

In `@k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml`:
- Around line 20-22: Add an explicit metadata.labels.pool entry to the manifest
for the resource named cli-reset-package by adding a labels map under metadata
with pool set to the correct test-pool value (choose one of core, interrupt,
uninstall, lifecycle) that matches this test's cluster-state/runtime mutation
profile; ensure the label key is metadata.labels.pool and it sits alongside the
existing metadata.name entry for the cli-reset-package manifest.

In `@k8s-tests/chainsaw/cli/reset-package/skyhook.yaml`:
- Line 28: The image reference "ghcr.io/nvidia/skyhook-packages/shellscript" in
the skyhook.yaml manifest is not an allowed registry prefix; update the image
fields (the occurrences at the image lines, currently set to
ghcr.io/nvidia/skyhook-packages/shellscript) to use an approved prefix such as
ghcr.io/nvidia/skyhook/ or ghcr.io/nvidia/nodewright/ (e.g., move the package
under ghcr.io/nvidia/skyhook/<repo> or ghcr.io/nvidia/nodewright/<repo>),
ensuring all occurrences (including the one noted also at the other image line)
are changed to match the allowed registry prefixes.

In `@k8s-tests/chainsaw/cli/update-state-reject/chainsaw-test.yaml`:
- Around line 20-22: This Chainsaw test resource (metadata.name
"cli-update-state-reject") lacks a metadata.labels.pool, so add a labels section
under metadata with pool set to the appropriate mutation profile value (one of
core, interrupt, uninstall, lifecycle) that matches this test's
cluster-state/runtime mutation behavior; ensure the label key is
metadata.labels.pool and that it is present alongside the existing metadata.name
so the suite is routed by mutation-profile-based routing.

In `@k8s-tests/chainsaw/cli/update-state-reject/skyhook.yaml`:
- Around line 25-30: Update the Skyhook package manifest under
spec.packages.test-package to explicitly set uninstall.enabled and
uninstall.apply (e.g., add uninstall: { enabled: <true|false>, apply:
<true|false> } to test-package) and change the package image
(packages.test-package.image) to an allowed registry prefix (use
ghcr.io/nvidia/skyhook/* or ghcr.io/nvidia/nodewright/* instead of
ghcr.io/nvidia/skyhook-packages/...) so the CR conforms to the manifest policy.

In `@k8s-tests/chainsaw/cli/update-state/chainsaw-test.yaml`:
- Around line 20-22: The Test manifest for metadata.name "cli-update-state" is
missing the required metadata.labels.pool label used for pool routing; add a
metadata.labels.pool entry under the manifest's metadata (for example set pool:
core, interrupt, uninstall, or lifecycle) choosing the value that matches the
cluster/runtime mutation profile exercised by the "cli-update-state" test so
Chainsaw routes the e2e correctly.

In `@k8s-tests/chainsaw/cli/update-state/skyhook.yaml`:
- Around line 25-30: The package entry for "test-package" is missing explicit
uninstall fields and uses a disallowed image registry; update the package spec
by adding an uninstall block with uninstall.enabled set to true or false as
appropriate and uninstall.apply set to the command or script to run on uninstall
(use the same key names uninstall.enabled and uninstall.apply under the
test-package entry), and replace the image
ghcr.io/nvidia/skyhook-packages/shellscript with an allowed registry image under
either ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/* (choose the
correct repository/name that provides the same shellscript functionality).

---

Outside diff comments:
In `@k8s-tests/chainsaw/cli/reset-package/skyhook.yaml`:
- Around line 21-42: Add an explicit spec.uninstall block to the Skyhook
manifest containing the uninstall.enabled and uninstall.apply fields so
uninstall behavior is deterministic; update the Skyhook CR that includes
packages pkg1 and pkg2 by adding e.g. spec.uninstall: enabled: true and apply:
true (or set the values required by the test) immediately under spec, ensuring
both uninstall.enabled and uninstall.apply are present for the manifest that
defines pkg1 and pkg2.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 29a19200-cdfe-4098-b906-9823414f3b56

📥 Commits

Reviewing files that changed from the base of the PR and between b264019 and f1efd9c.

📒 Files selected for processing (20)
  • docs/cli.md
  • k8s-tests/chainsaw/cli/reset-package/assert-skyhook-complete.yaml
  • k8s-tests/chainsaw/cli/reset-package/assert-skyhook-disabled.yaml
  • k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/reset-package/skyhook.yaml
  • k8s-tests/chainsaw/cli/update-state-reject/assert-skyhook-complete.yaml
  • k8s-tests/chainsaw/cli/update-state-reject/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/update-state-reject/skyhook.yaml
  • k8s-tests/chainsaw/cli/update-state/assert-skyhook-complete.yaml
  • k8s-tests/chainsaw/cli/update-state/assert-skyhook-disabled.yaml
  • k8s-tests/chainsaw/cli/update-state/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/update-state/skyhook.yaml
  • operator/cmd/cli/CHANGELOG.md
  • operator/cmd/cli/app/cli.go
  • operator/cmd/cli/app/reset.go
  • operator/cmd/cli/app/reset_test.go
  • operator/cmd/cli/app/update_state.go
  • operator/cmd/cli/app/update_state_test.go
  • operator/internal/cli/utils/utils.go
  • operator/internal/cli/utils/utils_test.go

Comment thread docs/cli.md
Comment thread k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml
packages:
pkg1:
version: "1.0.0"
image: ghcr.io/nvidia/skyhook-packages/shellscript
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use an allowed package image registry path.

ghcr.io/nvidia/skyhook-packages/shellscript does not match the allowed registry prefixes for Skyhook manifests.

As per coding guidelines, "Container images must be pulled from ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/* registries as per current distribution".

Also applies to: 36-36

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/reset-package/skyhook.yaml` at line 28, The image
reference "ghcr.io/nvidia/skyhook-packages/shellscript" in the skyhook.yaml
manifest is not an allowed registry prefix; update the image fields (the
occurrences at the image lines, currently set to
ghcr.io/nvidia/skyhook-packages/shellscript) to use an approved prefix such as
ghcr.io/nvidia/skyhook/ or ghcr.io/nvidia/nodewright/ (e.g., move the package
under ghcr.io/nvidia/skyhook/<repo> or ghcr.io/nvidia/nodewright/<repo>),
ensuring all occurrences (including the one noted also at the other image line)
are changed to match the allowed registry prefixes.

Comment on lines +20 to +22
metadata:
name: cli-update-state-reject
spec:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set a metadata.labels.pool on this Chainsaw test resource.

The test currently has no explicit pool classification, which breaks the expected mutation-profile-based routing of Chainsaw suites.

Based on learnings: In Chainsaw e2e test YAMLs, the pool label (core, interrupt, uninstall, lifecycle) must be assigned based on the cluster-state mutation/runtime mutation profile.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/update-state-reject/chainsaw-test.yaml` around lines
20 - 22, This Chainsaw test resource (metadata.name "cli-update-state-reject")
lacks a metadata.labels.pool, so add a labels section under metadata with pool
set to the appropriate mutation profile value (one of core, interrupt,
uninstall, lifecycle) that matches this test's cluster-state/runtime mutation
behavior; ensure the label key is metadata.labels.pool and that it is present
alongside the existing metadata.name so the suite is routed by
mutation-profile-based routing.

Comment on lines +25 to +30
packages:
test-package:
version: "1.0.0"
image: ghcr.io/nvidia/skyhook-packages/shellscript
configMap:
apply.sh: |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align package spec with Skyhook manifest policy.

spec.packages.test-package is missing explicit uninstall.enabled and uninstall.apply, and the image registry (ghcr.io/nvidia/skyhook-packages/...) is outside the allowed registry prefixes.

As per coding guidelines: Explicitly set uninstall.enabled and uninstall.apply fields to control package uninstall behavior in Skyhook CRs and Container images must be pulled from ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/* registries.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/update-state-reject/skyhook.yaml` around lines 25 -
30, Update the Skyhook package manifest under spec.packages.test-package to
explicitly set uninstall.enabled and uninstall.apply (e.g., add uninstall: {
enabled: <true|false>, apply: <true|false> } to test-package) and change the
package image (packages.test-package.image) to an allowed registry prefix (use
ghcr.io/nvidia/skyhook/* or ghcr.io/nvidia/nodewright/* instead of
ghcr.io/nvidia/skyhook-packages/...) so the CR conforms to the manifest policy.

Comment thread k8s-tests/chainsaw/cli/update-state/chainsaw-test.yaml
Comment on lines +25 to +30
packages:
test-package:
version: "1.0.0"
image: ghcr.io/nvidia/skyhook-packages/shellscript
configMap:
apply.sh: |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update package fields to satisfy manifest requirements.

This package entry omits uninstall.enabled/uninstall.apply, and uses an image from ghcr.io/nvidia/skyhook-packages/*, which is not one of the allowed registry prefixes.

As per coding guidelines: Explicitly set uninstall.enabled and uninstall.apply fields to control package uninstall behavior in Skyhook CRs and Container images must be pulled from ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/* registries.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/update-state/skyhook.yaml` around lines 25 - 30, The
package entry for "test-package" is missing explicit uninstall fields and uses a
disallowed image registry; update the package spec by adding an uninstall block
with uninstall.enabled set to true or false as appropriate and uninstall.apply
set to the command or script to run on uninstall (use the same key names
uninstall.enabled and uninstall.apply under the test-package entry), and replace
the image ghcr.io/nvidia/skyhook-packages/shellscript with an allowed registry
image under either ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/*
(choose the correct repository/name that provides the same shellscript
functionality).

_, _ = fmt.Fprintf(cmd.OutOrStdout(), "Batch state reset for Skyhook %q\n", skyhookName)
}

func splitPackageRef(s string) (name, version string, hasVersion bool, err error) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is used in both reset and update this should really be in utils.go

…pe has been stable since then

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

…: cli labels

- Move checkNodeStateOperatorVersion out of update_state.go into
  utils.CheckNodeStateOperatorVersion so reset.go and update_state.go
  share the same implementation instead of one calling across files.
  Signature switched to (kube, namespace) primitives so utils stays
  free of client.Client / cliContext dependencies.
- Label all k8s-tests/chainsaw/cli/* tests with pool: cli for
  classification consistency; document the new pool in
  docs/ci-test-pools.md (CLI tests are run by cli-e2e-tests, not by
  the pool-routed e2e-tests target, so the label is metadata-only for
  now).
- docs/cli.md: drop redundant --skip-batch-reset from the surgical
  recovery reset --package example (batch reset is already implied by
  --package per the same doc).

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/cli.md (1)

13-13: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update compatibility versions to match current command gating.

The version matrix and note still claim reset --package/update-state are supported on v0.7.5+, but this PR’s objective states these commands are gated at v0.15.0+. Please align the docs (including the minimum-version text) to prevent operators from following unsupported flows on older clusters.

Also applies to: 28-30, 36-40

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/cli.md` at line 13, The documentation currently states incompatible
minimum operator versions; update the CLI docs to reflect that the commands
"reset --package" and "update-state" (and the version matrix and minimum-version
note) require operator version v0.15.0 or later instead of v0.7.5+, and adjust
all related lines (including the version note and the matrix entries around the
earlier mentions) so they consistently say v0.15.0+ to prevent guidance for
older clusters.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml`:
- Around line 22-23: The YAML uses an unsupported pool label value "pool: cli";
update the label under labels (the `pool` key) to one of the supported
mutation-profile values—core, interrupt, uninstall, or lifecycle—based on which
cluster-state or runtime mutation this test exercises (i.e., replace `pool: cli`
with the correct profile name).

In `@k8s-tests/chainsaw/cli/update-state-reject/chainsaw-test.yaml`:
- Around line 22-23: The manifest sets labels: pool: cli which is unsupported
for Chainsaw e2e tests; update the pool label value under labels (the `pool`
key) to the correct mutation-profile for this test (choose one of: core,
interrupt, uninstall, or lifecycle) — replace `cli` with the appropriate profile
(e.g., `interrupt` if this test exercises runtime interruption, otherwise pick
`core`, `uninstall`, or `lifecycle`) so the test follows the Chainsaw routing
contract.

In `@operator/internal/cli/utils/utils.go`:
- Around line 393-414: Update the minimum supported operator version constant
used by CheckNodeStateOperatorVersion: change the value of
MinNodeStateSupportVersion (currently "v0.7.5") to the documented floor
"v0.15.0" so the compatibility check in CheckNodeStateOperatorVersion and any
callers use the new minimum; locate the MinNodeStateSupportVersion constant
definition and update it accordingly, then run unit/integration tests that cover
CompareVersions/IsValidVersion to ensure no regressions.

---

Outside diff comments:
In `@docs/cli.md`:
- Line 13: The documentation currently states incompatible minimum operator
versions; update the CLI docs to reflect that the commands "reset --package" and
"update-state" (and the version matrix and minimum-version note) require
operator version v0.15.0 or later instead of v0.7.5+, and adjust all related
lines (including the version note and the matrix entries around the earlier
mentions) so they consistently say v0.15.0+ to prevent guidance for older
clusters.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bf9df2f4-9be1-4f61-99a7-8e4aacbcf2c4

📥 Commits

Reviewing files that changed from the base of the PR and between fb5e862 and 47b40da.

📒 Files selected for processing (13)
  • docs/ci-test-pools.md
  • docs/cli.md
  • k8s-tests/chainsaw/cli/deployment-policy/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/lifecycle/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/node/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/package/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/reset/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/update-state-reject/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/update-state/chainsaw-test.yaml
  • operator/cmd/cli/app/reset.go
  • operator/cmd/cli/app/update_state.go
  • operator/internal/cli/utils/utils.go

Comment on lines +22 to +23
labels:
pool: cli
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a mutation-profile pool value, not cli.

pool: cli is not one of the supported Chainsaw pool values, so this test won't be classified by the expected mutation-profile routing. Please switch it to the correct pool for the behavior this scenario exercises.

As per coding guidelines: In Chainsaw e2e test YAMLs, the pool label (core, interrupt, uninstall, lifecycle) must be assigned based on the cluster-state mutation/runtime mutation profile the test exercises.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml` around lines 22 -
23, The YAML uses an unsupported pool label value "pool: cli"; update the label
under labels (the `pool` key) to one of the supported mutation-profile
values—core, interrupt, uninstall, or lifecycle—based on which cluster-state or
runtime mutation this test exercises (i.e., replace `pool: cli` with the correct
profile name).

Comment on lines +22 to +23
labels:
pool: cli
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a mutation-profile pool value, not cli.

pool: cli is outside the supported pool set, so this manifest won't follow the documented Chainsaw routing contract. Please assign the actual mutation-profile pool for this scenario (core, interrupt, uninstall, or lifecycle).

As per coding guidelines: In Chainsaw e2e test YAMLs, the pool label (core, interrupt, uninstall, lifecycle) must be assigned based on the cluster-state mutation/runtime mutation profile the test exercises.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/update-state-reject/chainsaw-test.yaml` around lines
22 - 23, The manifest sets labels: pool: cli which is unsupported for Chainsaw
e2e tests; update the pool label value under labels (the `pool` key) to the
correct mutation-profile for this test (choose one of: core, interrupt,
uninstall, or lifecycle) — replace `cli` with the appropriate profile (e.g.,
`interrupt` if this test exercises runtime interruption, otherwise pick `core`,
`uninstall`, or `lifecycle`) so the test follows the Chainsaw routing contract.

Comment on lines +393 to +414
// CheckNodeStateOperatorVersion rejects the call when the running operator is
// older than MinNodeStateSupportVersion. The check first reads the version
// annotation written by the operator onto the Skyhook CR; when that's missing
// or non-semver (e.g. "dev") it falls back to inspecting the operator
// Deployment. If neither source yields a valid version we warn but allow the
// edit to proceed — better than refusing every command in clusters where the
// CLI can't see the operator's namespace.
func CheckNodeStateOperatorVersion(ctx context.Context, cmd *cobra.Command, kube kubernetes.Interface, namespace string, skyhook *v1alpha1.Skyhook) error {
opVersion := GetSkyhookVersion(skyhook)
if opVersion == "" || !IsValidVersion(opVersion) {
deployVersion, derr := DiscoverOperatorVersion(ctx, kube, namespace)
if derr == nil && IsValidVersion(deployVersion) {
opVersion = deployVersion
} else {
_, _ = fmt.Fprintf(cmd.ErrOrStderr(), "Warning: unable to determine operator version; cannot verify compatibility (requires %s+)\n", MinNodeStateSupportVersion)
}
}
if IsValidVersion(opVersion) && CompareVersions(opVersion, MinNodeStateSupportVersion) < 0 {
return fmt.Errorf("operator version %s does not support this command; minimum supported version is %s", opVersion, MinNodeStateSupportVersion)
}
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Raise the shared operator-version floor to the documented minimum.

This helper now gates both new CLI commands, but it still compares against MinNodeStateSupportVersion = "v0.7.5". That lets operators older than the PR's v0.15.0 floor pass the compatibility check.

Suggested fix
-	// MinNodeStateSupportVersion is the lowest operator version known to use
-	// the current map[string]PackageStatus shape for the
-	// skyhook.nvidia.com/nodeState_<skyhook> annotation. The shape itself has
-	// been stable since this version; what has evolved in later releases is
-	// the set of recognized stage / state values (e.g. uninstall and
-	// uninstall-interrupt were added in v0.16.0). Users targeting an older
-	// operator must pick stage / state values that operator recognises.
-	MinNodeStateSupportVersion = "v0.7.5"
+	// MinNodeStateSupportVersion is the minimum operator version supported by
+	// the CLI node-state editing commands.
+	MinNodeStateSupportVersion = "v0.15.0"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@operator/internal/cli/utils/utils.go` around lines 393 - 414, Update the
minimum supported operator version constant used by
CheckNodeStateOperatorVersion: change the value of MinNodeStateSupportVersion
(currently "v0.7.5") to the documented floor "v0.15.0" so the compatibility
check in CheckNodeStateOperatorVersion and any callers use the new minimum;
locate the MinNodeStateSupportVersion constant definition and update it
accordingly, then run unit/integration tests that cover
CompareVersions/IsValidVersion to ensure no regressions.

Chainsaw runs script steps under /bin/sh, where `-o pipefail` is
invalid. The assertion scripts in cli-update-state and
cli-reset-package failed at startup with "sh: 1: set: Illegal option
-o pipefail" before they could check the annotation.

The update-state command itself ran fine (1 node updated); only the
post-mutation assertion shell failed to start. Switching to `set -eu`
keeps strict-error and undefined-variable behavior without the
bash-only pipefail.

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
k8s-tests/chainsaw/cli/update-state/chainsaw-test.yaml (1)

22-23: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a supported mutation-profile pool label instead of cli.

metadata.labels.pool must be one of core, interrupt, uninstall, or lifecycle; cli is not a valid routing value for Chainsaw pool classification.

As per coding guidelines: “In Chainsaw e2e test YAMLs, the pool label (core, interrupt, uninstall, lifecycle) must be assigned based on the cluster-state mutation/runtime mutation profile the test exercises.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/update-state/chainsaw-test.yaml` around lines 22 - 23,
The metadata.labels.pool currently uses the unsupported value "cli"; change
metadata.labels.pool to one of the allowed mutation-profile values (core,
interrupt, uninstall, or lifecycle) that matches the test's
cluster-state/runtime mutation behavior (e.g., set pool: core if it's a standard
core mutation test, interrupt for preemptive interrupt scenarios, uninstall for
teardown tests, or lifecycle for stateful lifecycle tests) by updating the
labels: pool entry accordingly in the YAML.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@k8s-tests/chainsaw/cli/update-state/chainsaw-test.yaml`:
- Around line 22-23: The metadata.labels.pool currently uses the unsupported
value "cli"; change metadata.labels.pool to one of the allowed mutation-profile
values (core, interrupt, uninstall, or lifecycle) that matches the test's
cluster-state/runtime mutation behavior (e.g., set pool: core if it's a standard
core mutation test, interrupt for preemptive interrupt scenarios, uninstall for
teardown tests, or lifecycle for stateful lifecycle tests) by updating the
labels: pool entry accordingly in the YAML.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8a3f8aa1-9c3f-4a7b-8aed-24dc0561c208

📥 Commits

Reviewing files that changed from the base of the PR and between 47b40da and ce2dbc8.

📒 Files selected for processing (2)
  • k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml
  • k8s-tests/chainsaw/cli/update-state/chainsaw-test.yaml

The cli-reset-package test exercises a two-package Skyhook lifecycle
which doesn't reliably fit in the previous 120s assert budget on
kind CI runners. Bump to 240s to match config-skyhook, which runs
similar multi-step lifecycles.

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
The `glob update` step in config-skyhook/chainsaw-test.yaml is the
canonical reproducer for issue #245 (Owned ConfigMap can permanently
desync from spec when a stale Status.ConfigUpdates entry drives a
spurious interrupt cycle). The operator gets exactly one good
reconcile window per spec change to satisfy the CM-write gate, and
timing on kind regularly misses it.

The issue itself documents this CI workaround:
"Skip the `glob update` step in config-skyhook/chainsaw-test.yaml
with a # TODO(<this-issue>) comment until the fix lands."

This is unrelated to PR #258's CLI changes; the test flake was
previously masked by other failures in the same file and surfaced
after PR #242 closed the StageInterrupt trap.

Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CLI: fine-grained edits to node-state annotation + targeted reset

1 participant