kubectl plugin for managing Skyhook deployments, packages, and nodes.
The Skyhook CLI (kubectl skyhook) provides SRE tooling for managing Skyhook operators and their packages across Kubernetes cluster nodes. It supports inspecting node/package state, forcing re-runs, managing node lifecycle, and retrieving logs.
The CLI requires operator version v0.8.0 or later for full functionality of all commands.
| Command | v0.7.x and earlier | v0.8.0+ |
|---|---|---|
version |
✅ Full | ✅ Full |
node status |
✅ Full | ✅ Full |
node list |
✅ Full | ✅ Full |
node reset |
✅ Full | ✅ Full |
node ignore/unignore |
✅ Full | ✅ Full |
package status |
✅ Full | ✅ Full |
package rerun |
✅ Full | ✅ Full |
package logs |
✅ Full | ✅ Full |
reset |
✅ Full | ✅ Full |
reset --package |
✅ Full (v0.7.5+) | ✅ Full |
update-state |
✅ Full (v0.7.5+) — see note | ✅ Full — see note |
deployment-policy reset |
❌ Not supported | ✅ Full |
pause |
❌ Not supported | ✅ Full |
resume |
❌ Not supported | ✅ Full |
disable |
❌ Not supported | ✅ Full |
enable |
❌ Not supported | ✅ Full |
Note on
update-stateandreset --package: These commands edit theskyhook.nvidia.com/nodeState_<skyhook>annotation in-place. The annotation'smap[string]PackageStatusshape has been stable since operator v0.7.5, and the CLI refuses to run against anything older. What has evolved across operator releases is the set of recognised stage and state values — for exampleuninstallanduninstall-interruptwere added in v0.16.0. Picking a stage/state your operator doesn't recognise will leave the package in a state the operator can't progress from. Confirm the stage/state values are valid for your operator before running these commands.
In operator versions v0.7.x and earlier, pausing and disabling a Skyhook was done via spec fields:
spec:
pause: true # Old method - no longer used by operatorStarting with v0.8.0, the operator uses annotations instead:
metadata:
annotations:
skyhook.nvidia.com/pause: "true"
skyhook.nvidia.com/disable: "true"The CLI's pause, resume, disable, and enable commands set these annotations. If you're running an older operator (v0.7.x or earlier), these commands will appear to succeed but the operator won't recognize the annotations - you'll need to edit the Skyhook spec directly using kubectl edit.
# Build from source
make build-cli
# Install as kubectl plugin
cp bin/kubectl-skyhook /usr/local/bin/
# Verify installation
kubectl skyhook versionkubectl skyhook [global-flags] <command> [subcommand] [flags] [arguments]
-h, --help- Show help for any command--version- Show version information-n, --namespace- Kubernetes namespace (default: "skyhook")-o, --output- Output format: table|json|yaml|wide-v, --verbose- Enable verbose output--dry-run- Preview changes without applying them--kubeconfig- Path to kubeconfig file
Show plugin and operator versions.
# Show both plugin and operator versions
kubectl skyhook version
# Show only plugin version (no cluster query)
kubectl skyhook version --client-only
# With custom timeout
kubectl skyhook version --timeout 10sControl Skyhook processing state.
Note: Requires operator v0.8.0+. See Compatibility for details.
# Pause a Skyhook (stops processing new nodes)
kubectl skyhook pause my-skyhook
kubectl skyhook pause my-skyhook --confirm # Skip confirmation
# Resume a paused Skyhook
kubectl skyhook resume my-skyhookCompletely disable or re-enable a Skyhook.
Note: Requires operator v0.8.0+. See Compatibility for details.
# Disable a Skyhook completely
kubectl skyhook disable my-skyhook
kubectl skyhook disable my-skyhook --confirm
# Re-enable a disabled Skyhook
kubectl skyhook enable my-skyhookReset all package state for a Skyhook, causing re-execution from the beginning.
# Reset all nodes for a Skyhook (also resets batch state by default)
kubectl skyhook reset gpu-init --confirm
# Preview changes without applying (dry-run)
kubectl skyhook reset gpu-init --dry-run
# Reset nodes only, preserve deployment policy batch state
kubectl skyhook reset gpu-init --skip-batch-reset --confirm
# Reset only a single package across all tracked nodes
kubectl skyhook reset gpu-init --package pkg1:1.0 --confirm --skip-batch-reset| Flag | Description |
|---|---|
--confirm, -y |
Skip confirmation prompt |
--skip-batch-reset |
Skip resetting deployment policy batch state |
--package <name>[:<version>] |
Reset only this package's state on each node |
Note: By default,
resetalso resets the deployment policy batch state so the next rollout starts from batch 1, and clears node ordering state (NodeOrderOffsetandNodePriority) soSKYHOOK_NODE_ORDERrestarts from0. Use--skip-batch-resetto preserve the existing batch and ordering state.
When --package is set, reset removes only the named package's entry from
each node's nodeState annotation instead of removing the whole annotation:
- If
<version>is supplied (e.g.pkg1:1.0), only entries whose recorded version matches are removed; nodes with the package at a different version are left untouched and reported in the command output. - If the package was the last entry in the annotation, the entire annotation is removed (matching the behavior of a full reset).
- Batch state is deliberately not reset on this path regardless of
--skip-batch-reset: restarting the full rollout from batch 1 to recover a single package would be disproportionate. Pair withkubectl skyhook deployment-policy resetexplicitly if you also want to restart batches.
Edit the recorded state of a single package on Skyhook-managed nodes.
update-state performs a surgical edit to the per-node
skyhook.nvidia.com/nodeState_<skyhook> annotation — replacing (or, with
--add, inserting) one entry in the map[string]PackageStatus value. It is
an administrator escape hatch for recovering from stuck rollouts and
deliberately does not validate that the requested (stage, state)
combination is one the operator could legally produce, nor does it gate
destructive stages (uninstall, uninstall-interrupt) behind extra
prompts.
kubectl skyhook update-state <skyhook-name> <package> <version> <stage> <state>
⚠️ Pause the Skyhook before runningupdate-state.
update-stateperforms a read-modify-write on the node-state annotation and uses a merge patch with no resource-version check. If the operator reconciles the node between the CLI's read and write, the operator can immediately overwrite the manual edit — at best wasting the operation, at worst racing the operator into an inconsistent state. Always pause the Skyhook (kubectl skyhook pause <name> --confirm) before running this command, and resume only when finished.
# Mark pkg1@1.0 as complete on every node that already tracks this Skyhook
kubectl skyhook update-state gpu-init pkg1 1.0 config complete --confirm
# Same, but only on one node
kubectl skyhook update-state gpu-init pkg1 1.0 config complete --node worker-1 --confirm
# Narrow to a label-selected set of nodes
kubectl skyhook update-state gpu-init pkg1 1.0 interrupt in_progress -l role=gpu --confirm
# Preview the changes without writing them
kubectl skyhook update-state gpu-init pkg1 1.0 config erroring --dry-run| Flag | Description |
|---|---|
--node |
Limit the update to specific node(s); repeat for multiple |
--selector, -l |
Limit the update to nodes matching a label selector |
--confirm, -y |
Skip confirmation prompt |
--add |
Create a fresh nodeState entry on nodes that do not yet have one for this package; requires --node or --selector |
The global --dry-run flag is honored: the command prints the set of nodes
it would patch and exits without writing.
By default update-state targets only nodes that already have a nodeState
entry for the named Skyhook. If --node names a node that does not exist
or has no state for the Skyhook, the command warns and skips that node
rather than failing.
--add creates a fresh nodeState entry on nodes that do not yet have one
for the given <package>@<version> — useful for bootstrapping state on a
node the operator has not visited yet, or for re-creating an entry that was
manually deleted.
--add requires either --node or --selector so the scope of the
creation is explicit. Without one of these flags, --add would apply to
every node in the cluster that matches the Skyhook's selector, which is
far too broad for a creation operation — the CLI rejects this combination
with an error.
If a targeted node already has an entry for <package>@<version>, --add
warns and skips that node (use update-state without --add if you intend
to overwrite the existing entry).
Manage deployment policy batch state.
Note: Requires operator v0.8.0+.
# Reset batch state for a Skyhook (starts rollout from batch 1)
kubectl skyhook deployment-policy reset gpu-init --confirm
# Preview what would be reset (dry-run)
kubectl skyhook deployment-policy reset gpu-init --dry-run
# Using the short alias
kubectl skyhook dp reset gpu-init --confirmThe deployment-policy reset command resets the batch processing state for all compartments in the specified Skyhook, including:
- Current batch number (reset to 1)
- Consecutive failure count
- Completed and failed node counts
- Stop flag
- Node ordering state (
NodeOrderOffsetandNodePriority) —SKYHOOK_NODE_ORDERrestarts from0
| Flag | Description |
|---|---|
--confirm, -y |
Skip confirmation prompt |
When to use:
- After a rollout completes and you want to start a new rollout fresh
- When batch processing is stuck and needs to be reset
- Before re-running a rollout with the same deployment policy
See Deployment Policy documentation for details on auto-reset configuration.
Manage Skyhook nodes across the cluster.
# List all nodes targeted by a Skyhook
kubectl skyhook node list --skyhook my-skyhook
kubectl skyhook node list --skyhook my-skyhook -o json
# Show all Skyhook activity on specific node(s)
kubectl skyhook node status worker-1
kubectl skyhook node status worker-1 worker-2
kubectl skyhook node status "worker-.*" # Regex pattern
kubectl skyhook node status worker-1 --skyhook my-skyhook # Filter by Skyhook
# Reset all package state on node(s)
kubectl skyhook node reset worker-1 --skyhook my-skyhook --confirm
kubectl skyhook node reset "node-.*" --skyhook my-skyhook --dry-run
# Ignore/unignore nodes from processing
kubectl skyhook node ignore worker-1
kubectl skyhook node ignore "test-node-.*"
kubectl skyhook node unignore worker-1| Command | Flag | Description |
|---|---|---|
list |
--skyhook |
Skyhook name (required) |
status |
--skyhook |
Filter by Skyhook name |
reset |
--skyhook |
Skyhook name (required) |
reset |
--confirm, -y |
Skip confirmation prompt |
Manage Skyhook packages.
# Query package status across nodes
kubectl skyhook package status my-package --skyhook my-skyhook
kubectl skyhook package status my-package --skyhook my-skyhook --node worker-1
kubectl skyhook package status my-package --skyhook my-skyhook -o wide
# Force package re-run on specific nodes
kubectl skyhook package rerun my-package --skyhook my-skyhook --node worker-1
kubectl skyhook package rerun my-package --skyhook my-skyhook --node "worker-.*" --confirm
kubectl skyhook package rerun my-package --skyhook my-skyhook --node worker-1 --stage config
# Get package logs
kubectl skyhook package logs my-package --skyhook my-skyhook
kubectl skyhook package logs my-package --skyhook my-skyhook --node worker-1
kubectl skyhook package logs my-package --skyhook my-skyhook --stage apply
kubectl skyhook package logs my-package --skyhook my-skyhook -f # Follow
kubectl skyhook package logs my-package --skyhook my-skyhook --tail 100| Command | Flag | Description |
|---|---|---|
status |
--skyhook |
Skyhook name (required) |
status |
--node |
Filter by node pattern (repeatable) |
rerun |
--skyhook |
Skyhook name (required) |
rerun |
--node |
Node pattern (required, repeatable) |
rerun |
--stage |
Re-run from stage: apply, config, interrupt, post-interrupt |
rerun |
--confirm, -y |
Skip confirmation prompt |
logs |
--skyhook |
Skyhook name (required) |
logs |
--node |
Filter by node name |
logs |
--stage |
Filter by stage |
logs |
-f, --follow |
Follow log output |
logs |
--tail |
Lines from end (-1 for all) |
# General help
kubectl skyhook --help
# Command group help
kubectl skyhook node --help
kubectl skyhook package --help
kubectl skyhook deployment-policy --help
# Specific command help
kubectl skyhook node reset --help
kubectl skyhook package rerun --help
kubectl skyhook deployment-policy reset --help# 1. Check package status
kubectl skyhook package status my-package --skyhook my-skyhook -o wide
# 2. View logs for the failed package
kubectl skyhook package logs my-package --skyhook my-skyhook --node worker-1
# 3. Fix the issue, then force re-run
kubectl skyhook package rerun my-package --skyhook my-skyhook --node worker-1 --confirm# 1. Ignore node before maintenance
kubectl skyhook node ignore worker-1
# 2. Perform maintenance...
# 3. Unignore and reset to re-run all packages
kubectl skyhook node unignore worker-1
kubectl skyhook node reset worker-1 --skyhook my-skyhook --confirm# List all nodes for a Skyhook
kubectl skyhook node list --skyhook my-skyhook
# Check status of all nodes
kubectl skyhook node status
# Check specific Skyhook across all nodes
kubectl skyhook node status --skyhook my-skyhook -o json# 1. Full reset: nodes + batch state (starts everything fresh)
kubectl skyhook reset my-skyhook --confirm
# 2. Or reset only batch state (keep node state, restart batch progression)
kubectl skyhook deployment-policy reset my-skyhook --confirm
# 3. Or reset only nodes (keep batch progression)
kubectl skyhook reset my-skyhook --skip-batch-reset --confirmWhen a single package on a single node is wedged and a full reset is too disruptive, pause the Skyhook, edit the recorded state directly, then resume:
# 1. Pause so the operator doesn't clobber the manual edit
kubectl skyhook pause my-skyhook --confirm
# 2. Mark the wedged package as complete (or set whatever stage/state
# you need to unblock the rollout)
kubectl skyhook update-state my-skyhook my-package 1.0 config complete \
--node worker-1 --confirm
# 3. Resume processing
kubectl skyhook resume my-skyhook --confirmFor a single-package reset across all nodes (without disturbing the
deployment policy batch state), prefer reset --package:
kubectl skyhook reset my-skyhook --package my-package:1.0 --confirmNote: Requires operator v0.8.0+. For older operators, use
kubectl edit skyhook my-skyhookand setspec.pause: true.
# Pause all processing
kubectl skyhook pause my-skyhook --confirm
# Or disable completely
kubectl skyhook disable my-skyhook --confirmAll status commands support multiple output formats:
# Table (default) - human-readable
kubectl skyhook node list --skyhook my-skyhook
# Wide - table with additional columns
kubectl skyhook node list --skyhook my-skyhook -o wide
# JSON - machine-readable
kubectl skyhook node list --skyhook my-skyhook -o json
# YAML - machine-readable
kubectl skyhook node list --skyhook my-skyhook -o yamloperator/cmd/cli/app/ # CLI commands
├── cli.go # Root command (NewSkyhookCommand)
├── version.go # Version command
├── reset.go # Reset command (nodes + batch state)
├── lifecycle.go # Pause, resume, disable, enable commands
├── deploymentpolicy/ # Deployment policy subcommands
│ ├── deploymentpolicy.go # Parent command
│ └── deploymentpolicy_reset.go # Batch state reset
├── node/ # Node subcommands
│ ├── node.go # Parent command
│ ├── node_list.go
│ ├── node_status.go
│ ├── node_reset.go
│ └── node_ignore.go # Ignore and unignore commands
└── package/ # Package subcommands
├── package.go # Parent command
├── package_status.go
├── package_rerun.go
└── package_logs.go
operator/internal/cli/ # Shared CLI utilities
├── client/ # Kubernetes client wrapper
├── context/ # CLI context and global flags
└── utils/ # Shared utilities
main()
└── cli.Execute()
└── NewSkyhookCommand(ctx)
├── NewVersionCmd(ctx)
├── NewResetCmd(ctx)
├── NewPauseCmd(ctx)
├── NewResumeCmd(ctx)
├── NewDisableCmd(ctx)
├── NewEnableCmd(ctx)
├── deploymentpolicy.NewDeploymentPolicyCmd(ctx)
│ └── NewResetCmd(ctx)
├── node.NewNodeCmd(ctx)
│ ├── NewListCmd(ctx)
│ ├── NewStatusCmd(ctx)
│ ├── NewResetCmd(ctx)
│ ├── NewIgnoreCmd(ctx)
│ └── NewUnignoreCmd(ctx)
└── pkg.NewPackageCmd(ctx)
├── NewStatusCmd(ctx)
├── NewRerunCmd(ctx)
└── NewLogsCmd(ctx)
# Run CLI tests
make test-cli
# Run all tests
make test