Skip to content
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
8a3a1ac
refactor(cli/utils): fix SetNodeAnnotation JSON quoting; extract Conf…
ayuskauskas May 27, 2026
47fb655
fix(cli/reset): surface list-nodes errors instead of silently no-op
ayuskauskas May 27, 2026
7fe97f1
docs(cli/utils): document ListNodesWithSkyhookState nil-map contract;…
ayuskauskas May 28, 2026
352a2ac
feat(cli): scaffold update-state command with --add flag
ayuskauskas May 28, 2026
f8028fd
feat(cli): implement update-state with validation, warnings, and --add
ayuskauskas May 28, 2026
0127a2e
fix(cli/update-state): surface list errors and align tests with reset…
ayuskauskas May 28, 2026
2159bdf
feat(cli): add --package flag to reset for targeted package state cle…
ayuskauskas May 28, 2026
78d67ee
test(e2e): chainsaw coverage for update-state
ayuskauskas May 28, 2026
e9fdcd4
test(e2e): chainsaw coverage for reset --package
ayuskauskas May 28, 2026
c6a22e4
docs(cli): document update-state and reset --package; warn about CR p…
ayuskauskas May 28, 2026
8473f6f
chore(cli): changelog for update-state, reset --package, and SetNodeA…
ayuskauskas May 28, 2026
f1efd9c
feat(cli): gate update-state and reset --package on operator version;…
ayuskauskas May 28, 2026
fb5e862
fix(cli): lower MinNodeStateSupportVersion to v0.7.5 — annotation sha…
ayuskauskas May 28, 2026
47b40da
refactor(cli): hoist CheckNodeStateOperatorVersion to utils; add pool…
ayuskauskas May 28, 2026
ce2dbc8
fix(e2e): chainsaw script preamble — drop pipefail (sh, not bash)
ayuskauskas May 28, 2026
5c7f7cb
fix(e2e): bump cli-reset-package assert timeout to 240s
ayuskauskas May 29, 2026
3b75547
test(e2e): skip config-skyhook glob update step per #245
ayuskauskas May 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,23 @@ The CLI requires **operator version v0.8.0 or later** for full functionality of
| `package rerun` | ✅ Full | ✅ Full |
| `package logs` | ✅ Full | ✅ Full |
| `reset` | ✅ Full | ✅ Full |
| `reset --package` | ⚠️ See note | ⚠️ See note |
| `update-state` | ⚠️ See note | ⚠️ See note |
| `deployment-policy reset` | ❌ Not supported | ✅ Full |
| `pause` | ❌ Not supported | ✅ Full |
| `resume` | ❌ Not supported | ✅ Full |
| `disable` | ❌ Not supported | ✅ Full |
| `enable` | ❌ Not supported | ✅ Full |

> **Note on `update-state` and `reset --package`:** These commands edit the
> `skyhook.nvidia.com/nodeState_<skyhook>` annotation in-place and rely on
> its current `map[string]PackageStatus` shape. They are safe with
> **operator v0.15.0 and later** (the version this CLI was developed
> against). Older operators may use a different annotation shape — verify
> the annotation format on your operator version before running these
> commands. If the CLI cannot parse the annotation on a node it surfaces
> a clear per-node error and skips the node.

### Breaking Change: Pause/Disable Mechanism

In operator versions **v0.7.x and earlier**, pausing and disabling a Skyhook was done via spec fields:
Expand Down Expand Up @@ -142,15 +153,107 @@ kubectl skyhook reset gpu-init --dry-run

# Reset nodes only, preserve deployment policy batch state
kubectl skyhook reset gpu-init --skip-batch-reset --confirm

# Reset only a single package across all tracked nodes
kubectl skyhook reset gpu-init --package pkg1:1.0 --confirm --skip-batch-reset
```

| Flag | Description |
|------|-------------|
| `--confirm, -y` | Skip confirmation prompt |
| `--skip-batch-reset` | Skip resetting deployment policy batch state |
| `--package <name>[:<version>]` | Reset only this package's state on each node |

> **Note:** By default, `reset` also resets the deployment policy batch state so the next rollout starts from batch 1, and clears node ordering state (`NodeOrderOffset` and `NodePriority`) so `SKYHOOK_NODE_ORDER` restarts from `0`. Use `--skip-batch-reset` to preserve the existing batch and ordering state.

#### `--package <name>[:<version>]`

When `--package` is set, `reset` removes only the named package's entry from
each node's `nodeState` annotation instead of removing the whole annotation:

- If `<version>` is supplied (e.g. `pkg1:1.0`), only entries whose recorded
version matches are removed; nodes with the package at a different version
are left untouched and reported in the command output.
- If the package was the last entry in the annotation, the entire annotation
is removed (matching the behavior of a full reset).
- Batch state is **deliberately not reset** on this path regardless of
`--skip-batch-reset`: restarting the full rollout from batch 1 to recover a
single package would be disproportionate. Pair with
`kubectl skyhook deployment-policy reset` explicitly if you also want to
restart batches.

### Update-State Command

Edit the recorded state of a single package on Skyhook-managed nodes.
`update-state` performs a surgical edit to the per-node
`skyhook.nvidia.com/nodeState_<skyhook>` annotation — replacing (or, with
`--add`, inserting) one entry in the `map[string]PackageStatus` value. It is
an **administrator escape hatch** for recovering from stuck rollouts and
deliberately does *not* validate that the requested `(stage, state)`
combination is one the operator could legally produce, nor does it gate
destructive stages (`uninstall`, `uninstall-interrupt`) behind extra
prompts.

```bash
kubectl skyhook update-state <skyhook-name> <package> <version> <stage> <state>
```

> **⚠️ Pause the Skyhook before running `update-state`.**
>
> `update-state` performs a read-modify-write on the node-state annotation and
> uses a merge patch with no resource-version check. If the operator
> reconciles the node between the CLI's read and write, the operator can
> immediately overwrite the manual edit — at best wasting the operation, at
> worst racing the operator into an inconsistent state. Always pause the
> Skyhook (`kubectl skyhook pause <name> --confirm`) before running this
> command, and resume only when finished.

```bash
# Mark pkg1@1.0 as complete on every node that already tracks this Skyhook
kubectl skyhook update-state gpu-init pkg1 1.0 config complete --confirm

# Same, but only on one node
kubectl skyhook update-state gpu-init pkg1 1.0 config complete --node worker-1 --confirm

# Narrow to a label-selected set of nodes
kubectl skyhook update-state gpu-init pkg1 1.0 interrupt in_progress -l role=gpu --confirm

# Preview the changes without writing them
kubectl skyhook update-state gpu-init pkg1 1.0 config erroring --dry-run
```

| Flag | Description |
|------|-------------|
| `--node` | Limit the update to specific node(s); repeat for multiple |
| `--selector, -l` | Limit the update to nodes matching a label selector |
| `--confirm, -y` | Skip confirmation prompt |
| `--add` | Create a fresh `nodeState` entry on nodes that do not yet have one for this package; requires `--node` or `--selector` |

The global `--dry-run` flag is honored: the command prints the set of nodes
it would patch and exits without writing.

By default `update-state` targets only nodes that already have a `nodeState`
entry for the named Skyhook. If `--node` names a node that does not exist
or has no state for the Skyhook, the command warns and skips that node
rather than failing.

#### `--add`

`--add` creates a fresh `nodeState` entry on nodes that do not yet have one
for the given `<package>@<version>` — useful for bootstrapping state on a
node the operator has not visited yet, or for re-creating an entry that was
manually deleted.

`--add` **requires** either `--node` or `--selector` so the scope of the
creation is explicit. Without one of these flags, `--add` would apply to
every node in the cluster that matches the Skyhook's selector, which is
far too broad for a creation operation — the CLI rejects this combination
with an error.

If a targeted node already has an entry for `<package>@<version>`, `--add`
warns and skips that node (use `update-state` without `--add` if you intend
to overwrite the existing entry).

### Deployment Policy Commands

Manage deployment policy batch state.
Expand Down Expand Up @@ -332,6 +435,32 @@ kubectl skyhook deployment-policy reset my-skyhook --confirm
kubectl skyhook reset my-skyhook --skip-batch-reset --confirm
```

### Surgical Recovery

When a single package on a single node is wedged and a full reset is too
disruptive, pause the Skyhook, edit the recorded state directly, then
resume:

```bash
# 1. Pause so the operator doesn't clobber the manual edit
kubectl skyhook pause my-skyhook --confirm

# 2. Mark the wedged package as complete (or set whatever stage/state
# you need to unblock the rollout)
kubectl skyhook update-state my-skyhook my-package 1.0 config complete \
--node worker-1 --confirm

# 3. Resume processing
kubectl skyhook resume my-skyhook --confirm
```

For a single-package reset across all nodes (without disturbing the
deployment policy batch state), prefer `reset --package`:

```bash
kubectl skyhook reset my-skyhook --package my-package:1.0 --confirm --skip-batch-reset
```

Comment thread
coderabbitai[bot] marked this conversation as resolved.
### Emergency Stop

> **Note:** Requires operator v0.8.0+. For older operators, use `kubectl edit skyhook my-skyhook` and set `spec.pause: true`.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
name: cli-reset-package-test
status:
status: complete
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
name: cli-reset-package-test
annotations:
skyhook.nvidia.com/disable: "true"
103 changes: 103 additions & 0 deletions k8s-tests/chainsaw/cli/reset-package/chainsaw-test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# yaml-language-server: $schema=https://raw.githubusercontent.com/kyverno/chainsaw/main/.schemas/json/test-chainsaw-v1alpha1.json
apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: cli-reset-package
spec:
Comment thread
coderabbitai[bot] marked this conversation as resolved.
timeouts:
assert: 120s
exec: 90s
steps:
# Step 0: Reset state from previous runs
- name: reset-state
try:
- script:
timeout: 30s
content: |
../skyhook-cli reset cli-reset-package-test --confirm 2>/dev/null || true
echo "State reset complete"

# Step 1: Create a two-package Skyhook and wait for both packages to complete
- name: setup-skyhook
try:
- apply:
file: skyhook.yaml
- assert:
file: assert-skyhook-complete.yaml

# Step 2: Disable the Skyhook so the operator won't immediately re-apply pkg1
# after we remove its state.
- name: disable-skyhook
try:
- script:
timeout: 30s
content: |
../skyhook-cli disable cli-reset-package-test --confirm
- assert:
file: assert-skyhook-disabled.yaml

# Step 3: Selectively reset only pkg1. --skip-batch-reset matches the
# implementation contract for --package (batch state is not reset for a
# one-package recovery).
- name: test-reset-package
try:
- script:
timeout: 30s
content: |
../skyhook-cli reset cli-reset-package-test --package pkg1 --confirm --skip-batch-reset

# Step 4: Assert the per-node annotation no longer contains pkg1|1.0.0 but
# still contains pkg2|2.0.0. The nodeState annotation is JSON-as-string, so
# we use jq rather than a native chainsaw assert.
- name: assert-only-pkg1-removed
try:
- script:
timeout: 30s
content: |
set -euo pipefail
nodes=$(kubectl get nodes -l skyhook.nvidia.com/test-node=skyhooke2e -o jsonpath='{.items[*].metadata.name}')
if [ -z "$nodes" ]; then
echo "no test nodes found"
exit 1
fi
fail=0
for node in $nodes; do
raw=$(kubectl get node "$node" -o jsonpath='{.metadata.annotations.skyhook\.nvidia\.com/nodeState_cli-reset-package-test}')
if [ -z "$raw" ]; then
echo "node $node: nodeState annotation missing (expected pkg2 to remain)"
fail=1
continue
fi
has_pkg1=$(printf '%s' "$raw" | jq 'has("pkg1|1.0.0")')
has_pkg2=$(printf '%s' "$raw" | jq 'has("pkg2|2.0.0")')
if [ "$has_pkg1" != "false" ]; then
echo "node $node: expected pkg1|1.0.0 to be removed, but it is still present"
echo "annotation: $raw"
fail=1
continue
fi
if [ "$has_pkg2" != "true" ]; then
echo "node $node: expected pkg2|2.0.0 to remain, but it is missing"
echo "annotation: $raw"
fail=1
continue
fi
echo "node $node: pkg1 removed, pkg2 retained OK"
done
exit $fail
41 changes: 41 additions & 0 deletions k8s-tests/chainsaw/cli/reset-package/skyhook.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
name: cli-reset-package-test
spec:
nodeSelectors:
matchLabels:
skyhook.nvidia.com/test-node: skyhooke2e
packages:
pkg1:
version: "1.0.0"
image: ghcr.io/nvidia/skyhook-packages/shellscript
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use an allowed package image registry path.

ghcr.io/nvidia/skyhook-packages/shellscript does not match the allowed registry prefixes for Skyhook manifests.

As per coding guidelines, "Container images must be pulled from ghcr.io/nvidia/nodewright/* or ghcr.io/nvidia/skyhook/* registries as per current distribution".

Also applies to: 36-36

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@k8s-tests/chainsaw/cli/reset-package/skyhook.yaml` at line 28, The image
reference "ghcr.io/nvidia/skyhook-packages/shellscript" in the skyhook.yaml
manifest is not an allowed registry prefix; update the image fields (the
occurrences at the image lines, currently set to
ghcr.io/nvidia/skyhook-packages/shellscript) to use an approved prefix such as
ghcr.io/nvidia/skyhook/ or ghcr.io/nvidia/nodewright/ (e.g., move the package
under ghcr.io/nvidia/skyhook/<repo> or ghcr.io/nvidia/nodewright/<repo>),
ensuring all occurrences (including the one noted also at the other image line)
are changed to match the allowed registry prefixes.

configMap:
apply.sh: |
#!/bin/bash
echo "reset-package test pkg1 apply"
sleep 2
pkg2:
version: "2.0.0"
image: ghcr.io/nvidia/skyhook-packages/shellscript
configMap:
apply.sh: |
#!/bin/bash
echo "reset-package test pkg2 apply"
sleep 2
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
name: cli-update-state-reject-test
status:
status: complete
Loading
Loading