You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ROADMAP.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ v1 is the first release we declare stable and production-ready. It is a promise,
21
21
22
22
**Graduate the CRDs to `v1`.** Lock the Skyhook and DeploymentPolicy schemas and ship a conversion webhook so existing clusters migrate in place.
23
23
24
-
**Freeze the package contract.** What the operator and CLI can know about a package changes fundamentally once packages are OCI artifacts, so the package contract is part of the surface we are freezing. The schema-affecting work that must land before the freeze: configurable node-drain options ([#259](https://github.com/NVIDIA/skyhook/issues/259)), admission-webhook validation of the `dependsOn` DAG (acyclic, every target present, versions well-formed), package-declared interrupts, and the new preflight stage. Dependency handling is validation only: there is no resolver and no auto-install of dependencies in v1.
24
+
**Freeze the package contract.** What the operator and CLI can know about a package changes fundamentally once packages are OCI artifacts, so the package contract is part of the surface we are freezing. The schema-affecting work that must land before the freeze: configurable node-drain options ([#259](https://github.com/NVIDIA/nodewright/issues/259)), admission-webhook validation of the `dependsOn` DAG (acyclic, every target present, versions well-formed), package-declared interrupts, and the new preflight stage. Dependency handling is validation only: there is no resolver and no auto-install of dependencies in v1.
25
25
26
26
**Feature-release process.** A defined way to introduce and graduate features after v1 so new work lands without destabilizing the frozen surface. The likely shape is a maturity path every feature flows through (off-by-default or opt-in before it becomes default), with feature flags as one mechanism for opting in or out. The process is the deliverable, not just the flags; because the flags surface is most likely a CRD field or operator config, its shape is part of what we freeze. This is the "how do we add things safely after the freeze" counterpart to the versioning policy below.
27
27
@@ -31,29 +31,29 @@ v1 is the first release we declare stable and production-ready. It is a promise,
31
31
32
32
### 2. Feature Completeness
33
33
34
-
**The shared core: Go agent and OCI/ORAS packages.** Completing the Python-to-Go agent rewrite and moving packages to OCI artifacts ([#194](https://github.com/NVIDIA/skyhook/issues/194), [#214](https://github.com/NVIDIA/skyhook/issues/214)–[#222](https://github.com/NVIDIA/skyhook/issues/222)) is the keystone of v1. Today the Python agent is a black box the operator can only run and watch. Once the agent is Go and packages are OCI artifacts, package-handling becomes one shared Go library that all three binaries use, each in its own role: the operator reads a package to drive orchestration (its steps, interrupts, uninstall support, schema, dependencies) without executing it; the CLI uses the same library to validate a package locally before it is applied; the agent streams and applies the layers.
34
+
**The shared core: Go agent and OCI/ORAS packages.** Completing the Python-to-Go agent rewrite and moving packages to OCI artifacts ([#194](https://github.com/NVIDIA/nodewright/issues/194), [#214](https://github.com/NVIDIA/nodewright/issues/214)–[#222](https://github.com/NVIDIA/nodewright/issues/222)) is the keystone of v1. Today the Python agent is a black box the operator can only run and watch. Once the agent is Go and packages are OCI artifacts, package-handling becomes one shared Go library that all three binaries use, each in its own role: the operator reads a package to drive orchestration (its steps, interrupts, uninstall support, schema, dependencies) without executing it; the CLI uses the same library to validate a package locally before it is applied; the agent streams and applies the layers.
35
35
36
36
This cutover must complete before GA. Shipping v1 on the Python agent and swapping the runtime to Go afterward would be too disruptive for users, so the cutover is a v1 blocker rather than a 1.x follow-up.
37
37
38
38
**Bring-your-own, discovered interrupts.** Interrupts stop being a fixed enum the operator and agent both hardcode (NoOp, NodeRestart, ServiceRestart). A package declares its own interrupt, and the operator discovers what a package needs by reading it. The package becomes the source of truth; the operator orchestrates whatever it finds. Interrupts configured in the spec do not go away: they become an **override** of the discovered behavior, so an operator can force or suppress an interrupt when they need to deviate from what the package reports.
39
39
40
40
**Preflight stage and dynamic cordon/drain.** A new `preflight` lifecycle stage runs before cordon/drain, where each package reports whether it actually needs to interrupt this time. Cordon and drain become dynamic: the operator skips them when nothing requires an interrupt, rather than cordoning and draining because a package might. Because preflight evaluates what would happen without doing it, running it in report-only mode is the natural substrate for dry-run.
41
41
42
-
**The CLI as an operations tool.** The shared library that reads packages powers client-side validation: validate a package, show "what will this do", and diff versions before anything touches a node. That validation and the operational verbs are in scope for v1: fine-grained node-state edits and targeted reset ([#257](https://github.com/NVIDIA/skyhook/issues/257)).
42
+
**The CLI as an operations tool.** The shared library that reads packages powers client-side validation: validate a package, show "what will this do", and diff versions before anything touches a node. That validation and the operational verbs are in scope for v1: fine-grained node-state edits and targeted reset ([#257](https://github.com/NVIDIA/nodewright/issues/257)).
43
43
44
44
**Acceptance:** a package distributed as a signed OCI artifact is inspected by the operator and applied by the Go agent end-to-end, and a node is cordoned and drained only when a package's preflight reports that an interrupt is required.
45
45
46
46
### 3. Production Hardening
47
47
48
-
**Replace unmaintained and unversioned dependencies.** Move operator metrics off kube-rbac-proxy to controller-runtime's built-in auth ([#206](https://github.com/NVIDIA/skyhook/issues/206)), which also resolves the TLS-handshake noise reports, and migrate the cleanup-job image off `bitnami/kubectl` to a maintained, versioned alternative ([#207](https://github.com/NVIDIA/skyhook/issues/207)).
48
+
**Replace unmaintained and unversioned dependencies.** Move operator metrics off kube-rbac-proxy to controller-runtime's built-in auth ([#206](https://github.com/NVIDIA/nodewright/issues/206)), which also resolves the TLS-handshake noise reports, and migrate the cleanup-job image off `bitnami/kubectl` to a maintained, versioned alternative ([#207](https://github.com/NVIDIA/nodewright/issues/207)).
49
49
50
-
**Correctness under partial failure.** Stop the ConfigMap volume from clobbering files baked into the package image ([#208](https://github.com/NVIDIA/skyhook/issues/208)); fix the ConfigMap desync where a stale `Status.ConfigUpdates` entry drives a spurious interrupt ([#245](https://github.com/NVIDIA/skyhook/issues/245)); fix the taint-on-reboot case when `applyOnReboot`, `runtimeRequired`, and `autoTaintNewNodes` combine ([#180](https://github.com/NVIDIA/skyhook/issues/180)); and surface ImagePullBackOff / ErrImagePull as an explicit error state instead of a silent hang.
50
+
**Correctness under partial failure.** Stop the ConfigMap volume from clobbering files baked into the package image ([#208](https://github.com/NVIDIA/nodewright/issues/208)); fix the ConfigMap desync where a stale `Status.ConfigUpdates` entry drives a spurious interrupt ([#245](https://github.com/NVIDIA/nodewright/issues/245)); fix the taint-on-reboot case when `applyOnReboot`, `runtimeRequired`, and `autoTaintNewNodes` combine ([#180](https://github.com/NVIDIA/nodewright/issues/180)); and surface ImagePullBackOff / ErrImagePull as an explicit error state instead of a silent hang.
51
51
52
-
**Run package execution as Jobs.** Migrate package execution from raw Pods to Kubernetes Jobs ([#223](https://github.com/NVIDIA/skyhook/issues/223)) so retry and backoff use native semantics.
52
+
**Run package execution as Jobs.** Migrate package execution from raw Pods to Kubernetes Jobs ([#223](https://github.com/NVIDIA/nodewright/issues/223)) so retry and backoff use native semantics.
53
53
54
54
**Verifiable provenance.** Build-time provenance already ships. v1 adds the runtime half: the operator and CLI verify a package's attestation via the OCI referrers API before pull or apply. This lands together with the OCI package work rather than as a separate path.
55
55
56
-
**Release and CI reliability.** Eliminate the false-positive e2e flakiness that forces reruns, and fix the bug where RC tags steal release notes from the stable release ([#246](https://github.com/NVIDIA/skyhook/issues/246)).
56
+
**Release and CI reliability.** Eliminate the false-positive e2e flakiness that forces reruns, and fix the bug where RC tags steal release notes from the stable release ([#246](https://github.com/NVIDIA/nodewright/issues/246)).
57
57
58
58
**Multi-tenant node access control (design-first).** The operator is a privileged deputy: at apply time, node mutations run with the operator's credentials, so RBAC on the namespaced Skyhook CR does not constrain which cluster-scoped nodes a tenant can target. Close this with admission-time authorization keyed on the requester's identity. This needs a design before implementation, and its v1 gating depends on whether multi-tenant operation is a v1 use case.
59
59
@@ -65,7 +65,7 @@ This cutover must complete before GA. Shipping v1 on the Python agent and swappi
65
65
66
66
**CLI distribution and compatibility.** The CLI is distributed independently of the operator and must work against any supported operator version. Keep the CLI-to-operator backward-compatibility matrix current and feature-detect rather than silently no-op.
67
67
68
-
**Project hygiene.** Add issue and PR-management automation (triage, labeler, welcome, lock-threads, inactive-PR reminders) ([#260](https://github.com/NVIDIA/skyhook/issues/260)) so the contribution pipeline scales with inbound flow.
68
+
**Project hygiene.** Add issue and PR-management automation (triage, labeler, welcome, lock-threads, inactive-PR reminders) ([#260](https://github.com/NVIDIA/nodewright/issues/260)) so the contribution pipeline scales with inbound flow.
69
69
70
70
**Install and upgrade UX.** Review the install and upgrade flow end-to-end so adoption does not require hand-holding.
Copy file name to clipboardExpand all lines: docs/kubernetes-support.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -120,6 +120,7 @@ Waiting 4+ weeks lets the ecosystem stabilize and gives us confidence in support
120
120
### How do you test compatibility?
121
121
122
122
For each Skyhook release, we test against all supported Kubernetes versions using:
123
+
123
124
- GitHub Actions matrix builds with multiple K8s versions. The exact tested patch versions are owned by `CI_KIND_NODE_IMAGE_VERSIONS_JSON` in `operator/k8s-test-versions.mk`.
124
125
- Local testing with [kind](https://kind.sigs.k8s.io/)
0 commit comments