Skip to content

Commit 62f5e9a

Browse files
committed
docs: add batch stickiness, SKYHOOK_NODE_ORDER, and reset changes for PR #183
1 parent 9f439c6 commit 62f5e9a

File tree

5 files changed

+59
-3
lines changed

5 files changed

+59
-3
lines changed

agent/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,4 +31,5 @@ There are a number of environment variables that can be used to control how the
3131

3232
The following are enviroment variables expected to be set by either the build system or skyhook-operator. It is not recommended they be changed manually.
3333
1. `OVERLAY_FRAMEWORK_VERSION` this the version of the current overlay. It is expected that this gets set by the docker build system. It is required to be able to manage the history file. It must be in the format of `{package name}-{version}`
34-
1. `SKYHOOK_RESOURCE_ID` this is used to determine if an interrupt should be rerun. Interrupts are only run once per `SKYHOOK_RESOURCE_ID`. Skyhook operator should make this unique per conifguration of the package.
34+
1. `SKYHOOK_RESOURCE_ID` this is used to determine if an interrupt should be rerun. Interrupts are only run once per `SKYHOOK_RESOURCE_ID`. Skyhook operator should make this unique per conifguration of the package.
35+
2. `SKYHOOK_NODE_ORDER` zero-indexed monotonic position of this node in the rollout. The first batch's nodes get `0, 1, 2, ...` and subsequent batches continue from where the previous batch left off. Useful for kubeadm upgrade workflows where the first node (`SKYHOOK_NODE_ORDER=0`) runs a different command than subsequent nodes. See [Node Order Within a Rollout](../docs/ordering_of_skyhooks.md#node-order-within-a-rollout) for details.

docs/cli.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ kubectl skyhook reset gpu-init --skip-batch-reset --confirm
147147
| `--confirm, -y` | Skip confirmation prompt |
148148
| `--skip-batch-reset` | Skip resetting deployment policy batch state |
149149

150-
> **Note:** By default, `reset` also resets the deployment policy batch state so the next rollout starts from batch 1. Use `--skip-batch-reset` to preserve the existing batch state.
150+
> **Note:** By default, `reset` also resets the deployment policy batch state so the next rollout starts from batch 1, and clears node ordering state (`NodeOrderOffset` and `NodePriority`) so `SKYHOOK_NODE_ORDER` restarts from `0`. Use `--skip-batch-reset` to preserve the existing batch and ordering state.
151151
152152
### Deployment Policy Commands
153153

@@ -171,6 +171,7 @@ The `deployment-policy reset` command resets the batch processing state for all
171171
- Consecutive failure count
172172
- Completed and failed node counts
173173
- Stop flag
174+
- Node ordering state (`NodeOrderOffset` and `NodePriority`) — `SKYHOOK_NODE_ORDER` restarts from `0`
174175

175176
| Flag | Description |
176177
|------|-------------|

docs/deployment_policy.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,16 @@ When strategy parameters are not specified, the operator applies these defaults:
171171

172172
---
173173

174+
## Batch Stickiness
175+
176+
Nodes selected for a batch remain in that batch until every node has reached a definitive outcome — all packages complete, erroring, or blocked. The controller will not select new nodes for the next batch while the current batch has nodes still running between packages.
177+
178+
Batch membership is tracked via `NodePriority` in the Skyhook status. A node stays in `NodePriority` from the time it is picked for a batch until it completes all packages. This state is persisted in the CRD, so it survives controller restarts.
179+
180+
Each package pod also receives a `SKYHOOK_NODE_ORDER` environment variable reflecting the node's monotonic position in the rollout. See [Node Order Within a Rollout](ordering_of_skyhooks.md#node-order-within-a-rollout) for details.
181+
182+
---
183+
174184
## Selectors and Node Matching
175185

176186
Compartments use standard Kubernetes label selectors:
@@ -326,6 +336,8 @@ kubectl skyhook reset my-skyhook --confirm
326336
kubectl skyhook reset my-skyhook --skip-batch-reset --confirm
327337
```
328338

339+
Both `reset` and `deployment-policy reset` also clear `NodeOrderOffset` and `NodePriority`, so the next rollout starts with fresh node ordering (`SKYHOOK_NODE_ORDER` begins at `0`).
340+
329341
See [CLI documentation](cli.md) for full command details.
330342

331343
---

docs/operator-status-definitions.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,4 +81,15 @@ uninstall (if downgrading) → cordon → wait → drain → apply → config
8181
cordon → wait → drain → upgrade (if upgrading) → config → interrupt → post-interrupt
8282
```
8383

84-
**Note**: The cordon, wait, and drain phases ensure that workloads are safely removed from the node before any package operations that require interrupts (such as reboots or kernel module changes) are executed.
84+
**Note**: The cordon, wait, and drain phases ensure that workloads are safely removed from the node before any package operations that require interrupts (such as reboots or kernel module changes) are executed.
85+
86+
## Skyhook Status Fields
87+
88+
The Skyhook resource's `.status` object includes fields that track batch rollout state. Two fields are particularly relevant for [batch stickiness](deployment_policy.md#batch-stickiness) and [node ordering](ordering_of_skyhooks.md#node-order-within-a-rollout):
89+
90+
| Field | Definition |
91+
|-------|------------|
92+
| `NodePriority` | Tracks which nodes are in the current active batch. A node stays in `NodePriority` from the time it is selected for a batch until it completes all packages. Prevents the controller from selecting new nodes while current batch nodes are between packages. |
93+
| `NodeOrderOffset` | Cumulative count of nodes removed from `NodePriority`. Combined with a node's position in the sorted `NodePriority` map, this produces the monotonic `SKYHOOK_NODE_ORDER` value injected into package pods. |
94+
95+
Both fields are persisted in the CRD and survive controller restarts. They are cleared by `kubectl skyhook reset` and `kubectl skyhook deployment-policy reset`.

docs/ordering_of_skyhooks.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,37 @@ In this example, fast nodes can install drivers independently, but all nodes mus
7070

7171
---
7272

73+
## Node Order Within a Rollout
74+
75+
The sections above cover ordering of Skyhooks relative to each other. This section covers ordering of **nodes** within a single Skyhook's rollout.
76+
77+
When a [DeploymentPolicy](deployment_policy.md) controls the batch rollout, each package pod receives a `SKYHOOK_NODE_ORDER` environment variable — a zero-indexed integer reflecting the node's position in the overall rollout order.
78+
79+
- The first batch's nodes are assigned `0, 1, 2, ...`
80+
- The second batch continues from where the first left off (e.g., `3, 4, 5, ...`)
81+
- Values are monotonically increasing across batches and never reused within a rollout
82+
- Within a batch, nodes are sorted by name for deterministic tiebreaking
83+
84+
### Use case: kubeadm upgrades
85+
86+
The primary motivation is kubeadm-style Kubernetes upgrades where the first control-plane node must run `kubeadm upgrade apply` and all subsequent nodes run `kubeadm upgrade node`:
87+
88+
```bash
89+
if [ "$SKYHOOK_NODE_ORDER" -eq 0 ]; then
90+
kubeadm upgrade apply v1.35.0
91+
else
92+
kubeadm upgrade node
93+
fi
94+
```
95+
96+
### Scope
97+
98+
`SKYHOOK_NODE_ORDER` reflects rollout order within a single Skyhook only. Cross-Skyhook ordering is controlled by `priority` and `sequencing` (documented above). If a Skyhook is reset via `kubectl skyhook reset`, the node order restarts from `0`.
99+
100+
See [Batch Stickiness](deployment_policy.md#batch-stickiness) for details on how batches are kept intact during rollout.
101+
102+
---
103+
73104
## Flow Control Annotations
74105

75106
Two flow control features can be set in the annotations of each skyhook:

0 commit comments

Comments
 (0)