Skip to content

Commit 507fa5e

Browse files
chore: Cherry pick again (#254)
* fix(test): uninstall downgraded package before cleanup in downgrade-after-uninstall (#241) The test left mypkg@1.2.3 freshly installed at chainsaw's automatic CLEANUP, where the new delete-time uninstall finalizer (from #200) must run uninstall pods on every node before releasing the CR. That exceeds chainsaw's default cleanup window, causing "context deadline exceeded". Add a final uninstall-v1 step that flips uninstall.apply=true on the downgraded version and waits for nodeState to empty, so the CR has no installed packages when cleanup deletes it. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com> * fix: deadlock in webhook controller when upgrading from old versions (#243) * fix: close StageInterrupt trap + harden core e2e pool (#242) * fix(test): harden core e2e pool against finalizer-cleanup races and improve diagnosability Two changes across the six core-pool chainsaw tests: 1. Add `timeouts.cleanup: 120s` to every core-pool test. Chainsaw's default cleanup window (~30s) is too tight for the project's CR finalizer, which must uncordon nodes, remove labels/annotations, and GC package pods before releasing. Same failure mode that hit downgrade-after-uninstall (PR #241). 2. Add `catch:` blocks to `depends-on` and `simple-skyhook`, mirroring the pattern used by the other four core tests. When these tests flake, CI now captures the Node, Skyhook CR, and package Pods for diagnosis instead of failing without artifacts. No assert logic changes — this is a foundation pass to remove a known flake class and make future flakes diagnosable. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com> * fix(operator): close StageInterrupt trap when configUpdates signal decays A package whose only interrupt is a `configInterrupts` entry (no top-level `interrupt` block) could get stuck at `StageInterrupt/StateSkipped` permanently when `Status.ConfigUpdates` cleared or never persisted (e.g. due to a 409 on the spec patch). The state machine queried the dynamic `HasInterrupt(config)` signal at four points, and once that signal decayed to false, the package was untouchable: `ProgressSkipped` wouldn't promote it, `NextStage` wouldn't advance it past Interrupt, and `GetComplete`/`IsPackageComplete` wouldn't count it complete. Decouple progression-past-Interrupt from the dynamic signal: - `NextStage`: add `StageInterrupt → StagePostInterrupt` to the no-interrupt default map. The with-interrupt full-replacement map is preserved as-is — PR #200 deliberately omits `StageUninstall → StageApply` from it so with-interrupt uninstalls route via `StageUninstallInterrupt`; collapsing the maps would silently re-enable that transition. - `ProgressSkipped`: drop the `HasInterrupt` gate. The only writer of `StateSkipped` at `StageInterrupt` is `ProcessInterrupt`'s budget-contention branch, which already decided the package needed an interrupt to schedule. Stage alone is sufficient. - `GetComplete`, `IsPackageComplete`: treat `StagePostInterrupt` as unconditionally terminal. The only way to reach it is via the interrupt cycle; gating the terminal check on a signal that can decay is redundant with the entry gate and only becomes load-bearing when the trap fires. Reproducer in `skyhook_types_test.go` exercises all three sites. Surfaced by the `chainsaw/config-skyhook` "update while running" step, which patches the spec mid-flight and concurrently triggers the `HandleMigrations` and `processSkyhooksPerNode` 409 conflict paths — those stretch convergence enough that ConfigUpdates can be lost between the package reaching StageInterrupt and ProgressSkipped firing. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com> * refactor(operator): GetComplete delegates to IsPackageComplete GetComplete and IsPackageComplete encoded the same terminal-state logic twice. Have GetComplete iterate the spec packages and call IsPackageComplete for each, and move the explanatory comment to IsPackageComplete where the predicate lives. Behavior preserved: the prior implementation iterated node state and filtered to spec packages by name+version match; the new one iterates spec packages and looks them up in node state by unique name (name|version). Same set of packages either way. Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com> --------- Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com> --------- Signed-off-by: Alex Yuskauskas <ayuskauskas@nvidia.com> Co-authored-by: ayuskauskas <ayuskauskas@nvidia.com>
1 parent 6bfd4cc commit 507fa5e

185 files changed

Lines changed: 18198 additions & 71375 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ This directory contains user and operator documentation for Skyhook. Here you'll
3030

3131
- [Operator Status Definitions](operator-status-definitions.md): Definitions of Status, State, Stage, and Condition concepts used throughout the Skyhook operator.
3232

33+
- [Webhook Bootstrap Lease](designs/webhook-bootstrap-lease.md): Why the admission-webhook cert bootstrap runs on a dedicated leader-election lease, and the runbook for upgrading from v0.7.x (where the new lease does not yet exist).
34+
3335
- **Process**
3436
- [Releases](releases.md):
3537
Release notes and upgrade information for Skyhook.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Webhook bootstrap lease
2+
3+
This document describes why the operator's webhook cert bootstrap runs under a
4+
**dedicated** leader-election lease, separate from the main reconcile lease, and
5+
what guarantees that design does (and does not) provide.
6+
7+
## Background: the v0.7.x → v0.15.x deadlock
8+
9+
Starting in v0.8.0 the operator self-bootstraps its admission-webhook TLS:
10+
on startup the leader generates a CA + serving cert, writes `Secret/webhook-cert`
11+
in the operator namespace, and patches the `caBundle` of the
12+
`MutatingWebhookConfiguration` and `ValidatingWebhookConfiguration`. Earlier
13+
versions (≤ v0.7.x) relied on a different bootstrap model and have no knowledge
14+
of `webhook-cert`.
15+
16+
On an in-place rollout from v0.7.x to a self-bootstrapping version, the cluster
17+
can enter the following dead state:
18+
19+
1. The new pod (v0.15.x) starts as a **follower**: the leader-election lease
20+
is still held by an old v0.7.x pod.
21+
2. The new pod's webhook controller is gated on leader election, so it never
22+
creates `webhook-cert`. Its readiness probe (which waits for the secret)
23+
never goes green.
24+
3. The old leader has no concept of `webhook-cert` and never creates it. The
25+
old leader also has nothing that would cause it to release the lease.
26+
4. Meanwhile, Argo/Helm applying the new manifests has reset the
27+
`MutatingWebhookConfiguration` `caBundle` to empty, so the apiserver no
28+
longer trusts the v0.7.x pods either. With `failurePolicy: Fail`, all
29+
CREATE/UPDATE on Skyhook CRs fail with
30+
`x509: certificate signed by unknown authority`.
31+
32+
Net result: the new pod waits forever for a secret that requires leadership to
33+
create, the old pod holds leadership forever because the new pod never goes
34+
Ready, and Argo cannot reconcile any further Skyhook config changes.
35+
36+
The only field operations that broke the cycle were manual:
37+
38+
- Delete the old-version pods to force a lease handover, *or*
39+
- Delete the lease object directly (`Lease/3c22c1ae.nvidia.com`).
40+
41+
## Design: a second lease, owned by a second manager
42+
43+
The webhook bootstrap is now driven by a **dedicated `ctrl.Manager`** with its
44+
own leader-election lease name (`nodewright-webhook-bootstrap.nvidia.com`), separate from the
45+
main reconcile manager's lease (`3c22c1ae.nvidia.com`).
46+
47+
```
48+
┌────────────────────────────────────────────────────────────────────────┐
49+
│ operator process (one per pod) │
50+
│ │
51+
│ ┌────────────────────────────────────┐ │
52+
│ │ reconcile manager │ lease: 3c22c1ae.nvidia.com │
53+
│ │ - SkyhookReconciler (leader-gated)│ │
54+
│ │ - SecretCertWatcher (every pod) │ │
55+
│ │ - webhook server (every pod) │ │
56+
│ │ - readyz / healthz │ │
57+
│ └────────────────────────────────────┘ │
58+
│ │
59+
│ ┌────────────────────────────────────┐ │
60+
│ │ webhook bootstrap manager │ lease: nodewright-webhook-bootstrap.nvidia.com │
61+
│ │ - WebhookController (leader-gated)│ │
62+
│ │ creates webhook-cert Secret │ │
63+
│ │ patches caBundle on │ │
64+
│ │ {Validating,Mutating}WebhookConfiguration │ │
65+
│ └────────────────────────────────────┘ │
66+
└────────────────────────────────────────────────────────────────────────┘
67+
```
68+
69+
`SecretCertWatcher` (`NeedLeaderElection() == false`) continues to run on the
70+
reconcile manager on every pod. It watches the namespace's Secrets via the
71+
reconcile manager's cache and syncs `webhook-cert` to the local disk so this
72+
pod's webhook server can serve TLS. That part is unchanged.
73+
74+
Both managers share the same `rest.Config` and `scheme` and start under the
75+
same signal handler via `errgroup.WithContext`. If either manager exits with
76+
an error the process exits non-zero.
77+
78+
### Invariants this preserves
79+
80+
- **Single writer to `webhook-cert` and to the webhook configuration caBundle.**
81+
Exactly one pod holds `nodewright-webhook-bootstrap.nvidia.com` at a time. There is no
82+
split-brain risk between operator versions that both understand this lease.
83+
- **Webhook server runs on every pod.** The webhook server, its readiness
84+
probe, and `SecretCertWatcher` are all on the reconcile manager, not gated on
85+
the bootstrap lease — so all pods can serve admission traffic as soon as
86+
`webhook-cert` exists locally.
87+
- **`failurePolicy: Fail` on the admission webhooks remains in effect.** We
88+
do not need to relax admission to make this work.
89+
90+
### What this fixes
91+
92+
Any *future* upgrade discontinuity between two versions that both understand
93+
the dedicated bootstrap lease: a new pod can win
94+
`nodewright-webhook-bootstrap.nvidia.com` and complete the cert bootstrap independently of
95+
who holds the main reconcile lease. Readiness flips green, Service endpoints
96+
rotate, the rollout completes.
97+
98+
### What this does NOT fix
99+
100+
This design **cannot** retroactively fix the v0.7.x → v0.15.x upgrade
101+
specifically, because v0.7.x has no knowledge of the new lease. Operators
102+
on v0.7.x still need the manual workaround:
103+
104+
```bash
105+
# In the operator's namespace:
106+
kubectl -n skyhook get pods # identify old-version pods
107+
kubectl -n skyhook delete pod <old-pod-1> <old-pod-2> # free the lease
108+
# Or, equivalently:
109+
kubectl -n skyhook delete lease 3c22c1ae.nvidia.com
110+
111+
# Verify recovery:
112+
kubectl -n skyhook get secret webhook-cert -w
113+
kubectl get mutatingwebhookconfiguration skyhook-operator-mutating-webhook \
114+
-o jsonpath='{.webhooks[0].clientConfig.caBundle}' | wc -c # must be > 0
115+
kubectl -n skyhook rollout status deploy/skyhook-operator-controller-manager
116+
```
117+
118+
The runbook above should be added to release notes for any future major
119+
operator version that changes the webhook bootstrap model.
120+
121+
## Implementation notes
122+
123+
- Lease names live as constants in `operator/cmd/manager/main.go`:
124+
`reconcileLeaseID`, `webhookBootstrapLeaseID`. **Do not rename them** without
125+
a coordinated upgrade plan — renaming the bootstrap lease re-introduces
126+
exactly the deadlock this design avoids.
127+
- The bootstrap manager is only constructed when `ENABLE_WEBHOOKS=true`.
128+
- The bootstrap manager disables its own metrics endpoint
129+
(`metricsserver.Options{BindAddress: "0"}`) and does not bind a health-probe
130+
port; the reconcile manager already owns those.
131+
- The bootstrap manager runs its own cache, which adds informers for
132+
`Secret/webhook-cert` and the two webhook configurations only. The reconcile
133+
manager's cache also watches Secrets via `SecretCertWatcher`. Two
134+
Secret informers in the operator namespace is the cost of the split and is
135+
considered acceptable.
136+
- `WebhookSecretReadyzCheck` reads through the bootstrap manager's client and
137+
is registered on the reconcile manager's health-probe server. Cache-sync
138+
races during startup naturally result in NotReady, which is the desired
139+
behavior.

k8s-tests/chainsaw/skyhook/config-skyhook/chainsaw-test.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ metadata:
2424
spec:
2525
timeouts:
2626
assert: 240s ## 5 steps with full package lifecycles including mid-flight updates
27+
cleanup: 120s ## finalizer needs time to uncordon, remove labels/annotations, and GC pods
2728
catch: ## if errors, print the most important info
2829
- get:
2930
apiVersion: v1

k8s-tests/chainsaw/skyhook/depends-on/chainsaw-test.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,24 @@ metadata:
2424
spec:
2525
timeouts:
2626
assert: 90s
27+
cleanup: 120s ## finalizer needs time to uncordon, remove labels/annotations, and GC pods
28+
catch: ## if errors, print the most important info
29+
- get:
30+
apiVersion: v1
31+
kind: Node
32+
selector: skyhook.nvidia.com/test-node=skyhooke2e
33+
format: yaml
34+
- get:
35+
apiVersion: skyhook.nvidia.com/v1alpha1
36+
kind: Skyhook
37+
name: depends-on
38+
format: yaml
39+
- get:
40+
apiVersion: v1
41+
kind: Pod
42+
namespace: skyhook
43+
selector: skyhook.nvidia.com/name=depends-on
44+
format: yaml
2745
steps:
2846
- name: setup
2947
description: Reset node state, create skyhook with dependencies, and assert it completes

k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/README.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ the complementary case where no uninstall is required.
2020
3. Update the spec to v1.2.3. The webhook accepts the downgrade because the
2121
package is absent from all tracked nodes. The new version installs and
2222
reaches `stage=config, state=complete`.
23+
4. Flip `uninstall.apply: true` on v1.2.3 and wait for the package to be
24+
absent from node state again. This is purely operational — it leaves the
25+
CR with no installed packages so chainsaw's automatic CLEANUP can delete
26+
it without waiting on the delete-time uninstall finalizer.
2327

2428
## Key Features Tested
2529

@@ -31,7 +35,8 @@ the complementary case where no uninstall is required.
3135

3236
## Files
3337

34-
- `chainsaw-test.yaml` — Main test: install v2 → uninstall v2 → downgrade to v1
38+
- `chainsaw-test.yaml` — Main test: install v2 → uninstall v2 → downgrade to v1 → uninstall v1
3539
- `skyhook.yaml` — Initial v2.1.4 Skyhook, `uninstall.enabled: true`
36-
- `update-trigger-uninstall.yaml` — Patch setting `uninstall.apply: true`
40+
- `update-trigger-uninstall.yaml` — Patch setting `uninstall.apply: true` on v2.1.4
3741
- `update-downgrade.yaml` — Patch changing version to v1.2.3
42+
- `update-trigger-uninstall-v1.yaml` — Patch setting `uninstall.apply: true` on v1.2.3 to clear node state before cleanup

k8s-tests/chainsaw/skyhook/downgrade-after-uninstall/chainsaw-test.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,3 +99,21 @@ spec:
9999
name: downgrade-after-uninstall
100100
status:
101101
status: complete
102+
103+
- name: uninstall-v1
104+
description: >
105+
Uninstall the downgraded v1.2.3 package so the CR has no installed
106+
packages at chainsaw cleanup time — keeps the finalizer's delete-time
107+
uninstall flow off the cleanup critical path.
108+
try:
109+
- update:
110+
file: update-trigger-uninstall-v1.yaml
111+
- assert:
112+
resource:
113+
apiVersion: v1
114+
kind: Node
115+
metadata:
116+
labels:
117+
skyhook.nvidia.com/test-node: skyhooke2e
118+
annotations:
119+
skyhook.nvidia.com/nodeState_downgrade-after-uninstall: '{}'
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
apiVersion: skyhook.nvidia.com/v1alpha1
18+
kind: Skyhook
19+
metadata:
20+
name: downgrade-after-uninstall
21+
spec:
22+
nodeSelectors:
23+
matchLabels:
24+
skyhook.nvidia.com/test-node: skyhooke2e
25+
interruptionBudget:
26+
count: 1
27+
packages:
28+
mypkg:
29+
version: "1.2.3"
30+
image: ghcr.io/nvidia/skyhook/agentless
31+
uninstall:
32+
enabled: true
33+
apply: true

k8s-tests/chainsaw/skyhook/simple-skyhook/chainsaw-test.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,24 @@ spec:
2525
timeouts:
2626
assert: 240s
2727
exec: 90s
28+
cleanup: 120s ## finalizer needs time to uncordon, remove labels/annotations, and GC pods
29+
catch: ## if errors, print the most important info
30+
- get:
31+
apiVersion: v1
32+
kind: Node
33+
selector: skyhook.nvidia.com/test-node=skyhooke2e
34+
format: yaml
35+
- get:
36+
apiVersion: skyhook.nvidia.com/v1alpha1
37+
kind: Skyhook
38+
name: simple-skyhook
39+
format: yaml
40+
- get:
41+
apiVersion: v1
42+
kind: Pod
43+
namespace: skyhook
44+
selector: skyhook.nvidia.com/name=simple-skyhook
45+
format: yaml
2846
steps:
2947
- name: deploy
3048
description: Reset state, apply a skyhook with three packages, and verify all complete with correct metrics

k8s-tests/chainsaw/skyhook/simple-update-skyhook/chainsaw-test.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ metadata:
2424
spec:
2525
timeouts:
2626
assert: 240s
27+
cleanup: 120s ## finalizer needs time to uncordon, remove labels/annotations, and GC pods
2728
catch: ## if errors, print the most important info
2829
- get:
2930
apiVersion: v1

k8s-tests/chainsaw/skyhook/strict-order/chainsaw-test.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ spec:
2525
timeouts:
2626
assert: 90s
2727
exec: 90s
28+
cleanup: 120s ## finalizer needs time to uncordon, remove labels/annotations, and GC pods
2829
catch: ## if errors, print the most important info
2930
- get:
3031
apiVersion: v1

0 commit comments

Comments
 (0)