Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
8a3a1ac
refactor(cli/utils): fix SetNodeAnnotation JSON quoting; extract Conf…
ayuskauskas May 27, 2026
47fb655
fix(cli/reset): surface list-nodes errors instead of silently no-op
ayuskauskas May 27, 2026
7fe97f1
docs(cli/utils): document ListNodesWithSkyhookState nil-map contract;…
ayuskauskas May 28, 2026
352a2ac
feat(cli): scaffold update-state command with --add flag
ayuskauskas May 28, 2026
f8028fd
feat(cli): implement update-state with validation, warnings, and --add
ayuskauskas May 28, 2026
0127a2e
fix(cli/update-state): surface list errors and align tests with reset…
ayuskauskas May 28, 2026
2159bdf
feat(cli): add --package flag to reset for targeted package state cle…
ayuskauskas May 28, 2026
78d67ee
test(e2e): chainsaw coverage for update-state
ayuskauskas May 28, 2026
e9fdcd4
test(e2e): chainsaw coverage for reset --package
ayuskauskas May 28, 2026
c6a22e4
docs(cli): document update-state and reset --package; warn about CR p…
ayuskauskas May 28, 2026
8473f6f
chore(cli): changelog for update-state, reset --package, and SetNodeA…
ayuskauskas May 28, 2026
f1efd9c
feat(cli): gate update-state and reset --package on operator version;…
ayuskauskas May 28, 2026
fb5e862
fix(cli): lower MinNodeStateSupportVersion to v0.7.5 — annotation sha…
ayuskauskas May 28, 2026
47b40da
refactor(cli): hoist CheckNodeStateOperatorVersion to utils; add pool…
ayuskauskas May 28, 2026
ce2dbc8
fix(e2e): chainsaw script preamble — drop pipefail (sh, not bash)
ayuskauskas May 28, 2026
5c7f7cb
fix(e2e): bump cli-reset-package assert timeout to 240s
ayuskauskas May 29, 2026
3b75547
test(e2e): skip config-skyhook glob update step per #245
ayuskauskas May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/ci-test-pools.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ Every test under `k8s-tests/chainsaw/skyhook/*/chainsaw-test.yaml` has a top-lev
- `uninstall` — uninstall, upgrade, downgrade lifecycle.
- `lifecycle` — pause/disable/delete/finalizer/state.

Tests under `k8s-tests/chainsaw/cli/` are a separate suite driven by the
`cli-e2e-tests` target (not `e2e-tests`). They are labelled
`pool: cli` for classification consistency but the `cli-e2e-tests` target
currently runs them all regardless of pool selector.

## Running locally

```bash
Expand Down
131 changes: 131 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,25 @@ The CLI requires **operator version v0.8.0 or later** for full functionality of
| `package rerun` | ✅ Full | ✅ Full |
| `package logs` | ✅ Full | ✅ Full |
| `reset` | ✅ Full | ✅ Full |
| `reset --package` | ✅ Full (v0.7.5+) | ✅ Full |
| `update-state` | ✅ Full (v0.7.5+) — see note | ✅ Full — see note |
| `deployment-policy reset` | ❌ Not supported | ✅ Full |
| `pause` | ❌ Not supported | ✅ Full |
| `resume` | ❌ Not supported | ✅ Full |
| `disable` | ❌ Not supported | ✅ Full |
| `enable` | ❌ Not supported | ✅ Full |

> **Note on `update-state` and `reset --package`:** These commands edit the
> `skyhook.nvidia.com/nodeState_<skyhook>` annotation in-place. The
> annotation's `map[string]PackageStatus` shape has been stable since
> **operator v0.7.5**, and the CLI refuses to run against anything older.
> What *has* evolved across operator releases is the set of recognised
> stage and state values — for example `uninstall` and `uninstall-interrupt`
> were added in v0.16.0. Picking a stage/state your operator doesn't
> recognise will leave the package in a state the operator can't progress
> from. Confirm the stage/state values are valid for your operator before
> running these commands.

### Breaking Change: Pause/Disable Mechanism

In operator versions **v0.7.x and earlier**, pausing and disabling a Skyhook was done via spec fields:
Expand Down Expand Up @@ -142,15 +155,107 @@ kubectl skyhook reset gpu-init --dry-run

# Reset nodes only, preserve deployment policy batch state
kubectl skyhook reset gpu-init --skip-batch-reset --confirm

# Reset only a single package across all tracked nodes
kubectl skyhook reset gpu-init --package pkg1:1.0 --confirm --skip-batch-reset
```

| Flag | Description |
|------|-------------|
| `--confirm, -y` | Skip confirmation prompt |
| `--skip-batch-reset` | Skip resetting deployment policy batch state |
| `--package <name>[:<version>]` | Reset only this package's state on each node |

> **Note:** By default, `reset` also resets the deployment policy batch state so the next rollout starts from batch 1, and clears node ordering state (`NodeOrderOffset` and `NodePriority`) so `SKYHOOK_NODE_ORDER` restarts from `0`. Use `--skip-batch-reset` to preserve the existing batch and ordering state.

#### `--package <name>[:<version>]`

When `--package` is set, `reset` removes only the named package's entry from
each node's `nodeState` annotation instead of removing the whole annotation:

- If `<version>` is supplied (e.g. `pkg1:1.0`), only entries whose recorded
version matches are removed; nodes with the package at a different version
are left untouched and reported in the command output.
- If the package was the last entry in the annotation, the entire annotation
is removed (matching the behavior of a full reset).
- Batch state is **deliberately not reset** on this path regardless of
`--skip-batch-reset`: restarting the full rollout from batch 1 to recover a
single package would be disproportionate. Pair with
`kubectl skyhook deployment-policy reset` explicitly if you also want to
restart batches.

### Update-State Command

Edit the recorded state of a single package on Skyhook-managed nodes.
`update-state` performs a surgical edit to the per-node
`skyhook.nvidia.com/nodeState_<skyhook>` annotation — replacing (or, with
`--add`, inserting) one entry in the `map[string]PackageStatus` value. It is
an **administrator escape hatch** for recovering from stuck rollouts and
deliberately does *not* validate that the requested `(stage, state)`
combination is one the operator could legally produce, nor does it gate
destructive stages (`uninstall`, `uninstall-interrupt`) behind extra
prompts.

```bash
kubectl skyhook update-state <skyhook-name> <package> <version> <stage> <state>
```

> **⚠️ Pause the Skyhook before running `update-state`.**
>
> `update-state` performs a read-modify-write on the node-state annotation and
> uses a merge patch with no resource-version check. If the operator
> reconciles the node between the CLI's read and write, the operator can
> immediately overwrite the manual edit — at best wasting the operation, at
> worst racing the operator into an inconsistent state. Always pause the
> Skyhook (`kubectl skyhook pause <name> --confirm`) before running this
> command, and resume only when finished.

```bash
# Mark pkg1@1.0 as complete on every node that already tracks this Skyhook
kubectl skyhook update-state gpu-init pkg1 1.0 config complete --confirm

# Same, but only on one node
kubectl skyhook update-state gpu-init pkg1 1.0 config complete --node worker-1 --confirm

# Narrow to a label-selected set of nodes
kubectl skyhook update-state gpu-init pkg1 1.0 interrupt in_progress -l role=gpu --confirm

# Preview the changes without writing them
kubectl skyhook update-state gpu-init pkg1 1.0 config erroring --dry-run
```

| Flag | Description |
|------|-------------|
| `--node` | Limit the update to specific node(s); repeat for multiple |
| `--selector, -l` | Limit the update to nodes matching a label selector |
| `--confirm, -y` | Skip confirmation prompt |
| `--add` | Create a fresh `nodeState` entry on nodes that do not yet have one for this package; requires `--node` or `--selector` |

The global `--dry-run` flag is honored: the command prints the set of nodes
it would patch and exits without writing.

By default `update-state` targets only nodes that already have a `nodeState`
entry for the named Skyhook. If `--node` names a node that does not exist
or has no state for the Skyhook, the command warns and skips that node
rather than failing.

#### `--add`

`--add` creates a fresh `nodeState` entry on nodes that do not yet have one
for the given `<package>@<version>` — useful for bootstrapping state on a
node the operator has not visited yet, or for re-creating an entry that was
manually deleted.

`--add` **requires** either `--node` or `--selector` so the scope of the
creation is explicit. Without one of these flags, `--add` would apply to
every node in the cluster that matches the Skyhook's selector, which is
far too broad for a creation operation — the CLI rejects this combination
with an error.

If a targeted node already has an entry for `<package>@<version>`, `--add`
warns and skips that node (use `update-state` without `--add` if you intend
to overwrite the existing entry).

### Deployment Policy Commands

Manage deployment policy batch state.
Expand Down Expand Up @@ -332,6 +437,32 @@ kubectl skyhook deployment-policy reset my-skyhook --confirm
kubectl skyhook reset my-skyhook --skip-batch-reset --confirm
```

### Surgical Recovery

When a single package on a single node is wedged and a full reset is too
disruptive, pause the Skyhook, edit the recorded state directly, then
resume:

```bash
# 1. Pause so the operator doesn't clobber the manual edit
kubectl skyhook pause my-skyhook --confirm

# 2. Mark the wedged package as complete (or set whatever stage/state
# you need to unblock the rollout)
kubectl skyhook update-state my-skyhook my-package 1.0 config complete \
--node worker-1 --confirm

# 3. Resume processing
kubectl skyhook resume my-skyhook --confirm
```

For a single-package reset across all nodes (without disturbing the
deployment policy batch state), prefer `reset --package`:

```bash
kubectl skyhook reset my-skyhook --package my-package:1.0 --confirm
```

Comment thread
coderabbitai[bot] marked this conversation as resolved.
### Emergency Stop

> **Note:** Requires operator v0.8.0+. For older operators, use `kubectl edit skyhook my-skyhook` and set `spec.pause: true`.
Expand Down
2 changes: 2 additions & 0 deletions k8s-tests/chainsaw/cli/deployment-policy/chainsaw-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: cli-deployment-policy-reset
labels:
pool: cli
spec:
timeouts:
assert: 150s
Expand Down
2 changes: 2 additions & 0 deletions k8s-tests/chainsaw/cli/lifecycle/chainsaw-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: cli-lifecycle
labels:
pool: cli
spec:
timeouts:
assert: 60s
Expand Down
2 changes: 2 additions & 0 deletions k8s-tests/chainsaw/cli/node/chainsaw-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: cli-node
labels:
pool: cli
spec:
timeouts:
assert: 120s
Expand Down
2 changes: 2 additions & 0 deletions k8s-tests/chainsaw/cli/package/chainsaw-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: cli-package
labels:
pool: cli
spec:
timeouts:
assert: 120s
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
name: cli-reset-package-test
status:
status: complete
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
name: cli-reset-package-test
annotations:
skyhook.nvidia.com/disable: "true"
105 changes: 105 additions & 0 deletions k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# yaml-language-server: $schema=https://raw.githubusercontent.com/kyverno/chainsaw/main/.schemas/json/test-chainsaw-v1alpha1.json
apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: cli-reset-package
labels:
pool: cli
Comment on lines +22 to +23
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a mutation-profile pool value, not cli.

pool: cli is not one of the supported Chainsaw pool values, so this test won't be classified by the expected mutation-profile routing. Please switch it to the correct pool for the behavior this scenario exercises.

As per coding guidelines: In Chainsaw e2e test YAMLs, the pool label (core, interrupt, uninstall, lifecycle) must be assigned based on the cluster-state mutation/runtime mutation profile the test exercises.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml` around lines 22 -
23, The YAML uses an unsupported pool label value "pool: cli"; update the label
under labels (the `pool` key) to one of the supported mutation-profile
values—core, interrupt, uninstall, or lifecycle—based on which cluster-state or
runtime mutation this test exercises (i.e., replace `pool: cli` with the correct
profile name).

spec:
Comment thread
coderabbitai[bot] marked this conversation as resolved.
timeouts:
assert: 240s
exec: 90s
steps:
# Step 0: Reset state from previous runs
- name: reset-state
try:
- script:
timeout: 30s
content: |
../skyhook-cli reset cli-reset-package-test --confirm 2>/dev/null || true
echo "State reset complete"

# Step 1: Create a two-package Skyhook and wait for both packages to complete
- name: setup-skyhook
try:
- apply:
file: skyhook.yaml
- assert:
file: assert-skyhook-complete.yaml

# Step 2: Disable the Skyhook so the operator won't immediately re-apply pkg1
# after we remove its state.
- name: disable-skyhook
try:
- script:
timeout: 30s
content: |
../skyhook-cli disable cli-reset-package-test --confirm
- assert:
file: assert-skyhook-disabled.yaml

# Step 3: Selectively reset only pkg1. --skip-batch-reset matches the
# implementation contract for --package (batch state is not reset for a
# one-package recovery).
- name: test-reset-package
try:
- script:
timeout: 30s
content: |
../skyhook-cli reset cli-reset-package-test --package pkg1 --confirm --skip-batch-reset

# Step 4: Assert the per-node annotation no longer contains pkg1|1.0.0 but
# still contains pkg2|2.0.0. The nodeState annotation is JSON-as-string, so
# we use jq rather than a native chainsaw assert.
- name: assert-only-pkg1-removed
try:
- script:
timeout: 30s
content: |
set -eu
nodes=$(kubectl get nodes -l skyhook.nvidia.com/test-node=skyhooke2e -o jsonpath='{.items[*].metadata.name}')
if [ -z "$nodes" ]; then
echo "no test nodes found"
exit 1
fi
fail=0
for node in $nodes; do
raw=$(kubectl get node "$node" -o jsonpath='{.metadata.annotations.skyhook\.nvidia\.com/nodeState_cli-reset-package-test}')
if [ -z "$raw" ]; then
echo "node $node: nodeState annotation missing (expected pkg2 to remain)"
fail=1
continue
fi
has_pkg1=$(printf '%s' "$raw" | jq 'has("pkg1|1.0.0")')
has_pkg2=$(printf '%s' "$raw" | jq 'has("pkg2|2.0.0")')
if [ "$has_pkg1" != "false" ]; then
echo "node $node: expected pkg1|1.0.0 to be removed, but it is still present"
echo "annotation: $raw"
fail=1
continue
fi
if [ "$has_pkg2" != "true" ]; then
echo "node $node: expected pkg2|2.0.0 to remain, but it is missing"
echo "annotation: $raw"
fail=1
continue
fi
echo "node $node: pkg1 removed, pkg2 retained OK"
done
exit $fail
Loading
Loading