fix: close override snapshot transition race on update#691
fix: close override snapshot transition race on update#691Yetkin Timocin (ytimocin) wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a race in the override controllers’ snapshot “latest” transition by switching from demote-old-then-create-new to create-new-first-then-demote-old, preventing a crash window where no snapshot is labeled is-latest-snapshot=true and rollout would not be enqueued until a long resync. To make the brief “double latest=true” window safe, it adds read-time deduplication in the overrider hot path and introduces a sibling-audit helper to clean up stale latest=true labels.
Changes:
- Reordered CRO/RO snapshot updates to create-first, demote-after, including an
AlreadyExistsrecovery path that verifies snapshot hash via an uncached reader and emits a Warning event on mismatch. - Added
cleanupStaleLatestSiblingsto flip stalelatest=truelabels on older snapshots, and added read-time dedup inFetchAllMatchingOverridesForResourceSnapshot. - Added/updated unit + integration tests for dedup logic and sibling cleanup behaviors; updated controller wiring to pass an uncached reader.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/utils/overrider/overrider.go | Deduplicates “latest” override snapshots at read time to tolerate transient duplicates. |
| pkg/utils/overrider/overrider_test.go | Adds unit tests for generic dedup + a fetch test covering duplicate latest snapshots. |
| pkg/controllers/overrider/common.go | Adds UncachedReader, event recorder field, snapshot pruning slice trimming, and sibling cleanup helper. |
| pkg/controllers/overrider/common_test.go | Adds integration tests for sibling cleanup and AlreadyExists recovery branches. |
| pkg/controllers/overrider/clusterresource_controller.go | Implements create-first/demote-after ordering + AlreadyExists hash verification + event emission. |
| pkg/controllers/overrider/resource_controller.go | Same ordering + recovery logic for namespaced overrides; sets up recorder. |
| pkg/controllers/overrider/suite_test.go | Wires UncachedReader into envtest manager setup. |
| cmd/hubagent/workload/setup.go | Wires UncachedReader into hub-agent override controller setup. |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
E2E validation on Kind clustersRan four targeted edge-case tests against this branch on a fresh Setup
Edge cases
Sample resultsTest 1 — after force-killing all hub-agent pods ( Exactly one Test 2 — 10 patches submitted in parallel via Test 3 — manually re-flipped snapshot 0 to The new audit helper fired exactly once on the injected duplicate. Test 4 — two Hub-agent log scan across all four runs
Side observationThe override controller uses Baseline
|
| } | ||
| // AlreadyExists here normally means a prior reconcile succeeded the Create but failed to | ||
| // flip the old snapshot. Verify the existing object's hash before proceeding so we don't | ||
| // silently promote stale content (etcd restore from backup, manual edit, future hash bug). |
There was a problem hiding this comment.
Have a question: why would "etcd restore from backup" result in AlreadyExists path?
What does "future hash bug" mean? Does it mean hash function change?
There was a problem hiding this comment.
For the 1st question, I was trying to be extra defensive by thinking about a partial restore. And, for the 2nd question, yes, I was thinking about a hash function change in between versions.
| klog.ErrorS(getErr, "Failed to get existing overrideSnapshot for hash verification", "resourceOverride", roKObj, "newOverrideSnapshot", klog.KObj(newSnapshot)) | ||
| return controller.NewAPIServerError(false, getErr) | ||
| } | ||
| if string(existing.Spec.OverrideHash) != overrideSpecHash { |
There was a problem hiding this comment.
Maybe we can recompute overrideHash of existing to avoid situations where someone modifies the overrideHash?
There was a problem hiding this comment.
API Server applies defaults on Create, and, that is why, the hashes would be different between Create and re-hash of a read-back object.
The defaults come from kubebuilder:default markers on the API types — that's what makes the original and read-back objects differ.
| // The content has not changed; no new snapshot is needed. Ensure the highest-index | ||
| // snapshot is marked latest, then audit siblings to clean up any duplicate latest=true | ||
| // left by a prior partial run. | ||
| if err := r.ensureSnapshotLatest(ctx, latestSnapshot); err != nil { |
There was a problem hiding this comment.
When would we need to mark the snapshot as latest?
If we always create a new snapshot with is-latest = true, the latest snapshot already has is-latest label set to true even if a crash happened right after a new snapshot was created?
Want to understand the scenario that you have in mind so that we can think about how to handle.
There was a problem hiding this comment.
ensureSnapshotLatest was already here before this PR — I just kept calling it from the hash-match path as defensive insurance against out-of-band label edits. It's a no-op in steady state. Want me to drop it as part of this PR, or leave it alone since it's pre-existing?
a01c978 to
203995e
Compare
michaelawyu (michaelawyu)
left a comment
There was a problem hiding this comment.
Hi Yetkin! I added some comments; would be really happy to discuss about this issue further.
| if len(snapshotList.Items) != 0 { | ||
| // convert the last unstructured snapshot to the typed object | ||
| latestSnapshot := &placementv1beta1.ClusterResourceOverrideSnapshot{} | ||
| latestSnapshot = &placementv1beta1.ClusterResourceOverrideSnapshot{} |
There was a problem hiding this comment.
Hi Yetkin! Thanks so much for this. It is a very interesting PR and I know that such fixes take a long while and a lot of efforts to complete. It is just that, I couldn't help feeling that given the nature of the bug, i.e., rollout controller failing to catch the state change where a resource snapshot is marked as latest again, it can just be easily fixed by tweaking the handler funcs in the rollout controller a bit:
There was a problem hiding this comment.
The system just need an UpdateFunc that enqueues placement if it sees that for a snapshot update event, (e.ObjectOld.GetLabels()[isLatestLabelName] != e.ObjectNew.GetLabels()[isLatestLabelName]) && e.ObjectNew.GetLabels()[isLatestLabelName] == "true" (very roughly written).
There was a problem hiding this comment.
(There is actually a bit more to this bug -> please see some of the other comments, happy to discuss about this a bit more and make sure that I understand it correctly)
| klog.ErrorS(err, "Failed to set the isLatestSnapshot label to false", "clusterResourceOverride", croKObj, "overrideSnapshot", klog.KObj(latestSnapshot)) | ||
| return controller.NewUpdateIgnoreConflictError(err) | ||
| // Hash matches: re-mark this snapshot latest (idempotent) and audit siblings. | ||
| if err := r.ensureSnapshotLatest(ctx, latestSnapshot); err != nil { |
There was a problem hiding this comment.
This part is a bit interesting I think. If we do switch to the create-then-demote model, this part is a bit redundant (it is either always a no-op or always fail due to conflicts). And it cannot be reliably used to correct unexpected cases either (e.g., a latest label is set to false unexpectedly), as it would also trigger the issue we want to fix, i.e., the rollout controller missing the update.
| "existing snapshot %s has different content than the current spec; deleting it for the next reconcile to recreate. If this persists, investigate the snapshot for manual edits or a hash-function change between controller versions.", newSnapshot.Name) | ||
| // Drive observed → desired: delete the mismatched snapshot so the next reconcile | ||
| // recreates it with the correct content. | ||
| if delErr := r.Client.Delete(ctx, existing); delErr != nil && !errors.IsNotFound(delErr) { |
There was a problem hiding this comment.
This deletion part is a bit problematic I fear.
Aside from the semantic concerns (the snapshots should have monotonically, continuous indexes), the snapshot we would like to delete might have already been referenced in some bindings.
There was a problem hiding this comment.
This might lead to the work generator getting stuck in a loop of errors when processing some bindings as it can no longer find a referenced snapshot. As our rollout (mostly rolling update) checks for Applied/Availability info, it might also lead a rollout to be stuck.
| // cycle until the snapshot's labels are repaired. | ||
| // | ||
| // PT is *T constrained to client.Object so we can extract metadata and labels without copying. | ||
| func dedupLatestSnapshots[T any, PT interface { |
There was a problem hiding this comment.
Hi Yetkin! This part is particularly interesting I think.
One of the major benefits that the demote then create pattern yields is that, it guarantees that at any given point of watches (with the cache lagging or not), the consumer (whoever needs to read/get notified on the appearance of a latest snapshot) will either see:
a) there is no latest snapshot -> which means the system observes an intermediate state and can simply wait for the next event; or
b) there is exactly one latest snapshot.
So the consumer can just do a list op with a label selector, and decide what to do based on the number of items returned. No additional sorting/ranking is needed.
There was a problem hiding this comment.
e.g,
0 (latest) - | -> 0 (demoted) - | -> 1 (latest) - | -> 1 (demoted) - | -> 2 (latest) - | -> ...
| = time point where the consumer reads data
There was a problem hiding this comment.
But with the create then demote pattern,
e.g,
0 (latest) - | -> 1 (latest) - | -> 0 (demoted) - | -> 2 (latest) - | -> 1 (demoted) - | -> ...
Roughly (very roughly) 50% of the time, the consumer will see that there are multiple (2) latest snapshots. And it must do the sorting/ranking to determine the actual latest snapshot (or it can just back off until the situation is resolved by the snapshotting controller). The correctness part, as Yetkin you have acutely catches in this part of the code, can still be guaranteed, but it kinds of defeats the purpose of having the latest label. As the source of truth always lies in the monotonically increasing indexes, the consumer can simply just list every snapshot relevant and do the sorting/ranking every time to find the latest snapshot -> it's more costly, but more straightforward.
| // Compare stored OverrideHash, not a re-hash of existing.Spec.OverrideSpec: API-server | ||
| // defaulting on Create (selectionScope, scope, overrideType) makes the read-back spec | ||
| // differ from the input, so a recompute would never match. | ||
| if string(existing.Spec.OverrideHash) != overrideSpecHash { |
There was a problem hiding this comment.
This part is particularly tricky I assume (the flow itself, not specific to this PR)...
Generally speaking, if the user makes different edits each time (A -> B -> C -> ...) and we do not consider hash collision an issue, then this branch will never run and the issue of a past snapshot being marked as latest again will never occur.
The problem will only happen when the user does an A -> B -> A fashion of edits:
There was a problem hiding this comment.
In the case of a demote then create flow, there are two possibilities:
case 1 =>
there is a snapshot 1 with hash A -> the user edits overrides that has hash B -> the controller is triggered -> the controller has snapshot 1 demoted, but failed to create snapshot 2 with hash B -> the user edits overrides back to the state of hash A -> the controller is triggered again and marked snapshot 1 as latest
case 1 is harmless, as it still aligns with our eventually consistent model; the system is just behaving as if the edit of hash B never happened. There is no rollout needed either.
case 2 =>
there is a snapshot 1 with hash A -> the user edits overrides that has hash B -> the controller is triggered -> the controller has snapshot 1 demoted, and created snapshot 2 with hash B -> the user edits overrides back to the state of hash A -> the controller is triggered again, but due to cache lagging, it just sees snapshot 1 in its demoted state, snapshot 2 hasn't appeared yet -> the controller marks snapshot 1 as latest again -> the system is trapped in an erred state with two latest snapshots
case 2 is harmful; the consumer side needs to handle this by doing its own sorting/ranking, and the producer side needs to fix the situation in later reconciliations.
There was a problem hiding this comment.
And this is not a demote then create flow specific problem -> create then demote can have this as well. And this PR has fixed this by adding additional logic to clean up the history and setting the consumer to de-dup as needed.
But these occurrences are rare in nature -> if we bring object generation into the hash calculation the chances would get even more slim (to the brim of being impossible). Instead of having the cleanup logic and de-duppiing run often, I wonder if it would actually be easier to just:
a) on the consumer side, list with latest label selector, and back off if there are multiple latest snapshots -> wait for the producer side to fix things
b) on the producer side, hash with object generation embedded, and in very rare cases where multiple latest snapshots do appear, do the cleanup.
There was a problem hiding this comment.
Important to note that any change we make must guarantee that the new flow won't trigger rollouts on existing placements, so the hashing part might not be easily editable.
c452a28 to
0ff472d
Compare
The override controllers previously flipped is-latest-snapshot=false on the old snapshot before creating the new one. Because the rollout watcher only handles Create/Generic events (no UpdateFunc), a crash between flip and create left rollout un-enqueued until the next resync (6h hub-agent default) or parent-CR mutation, silently dropping overrides on member clusters. Reorder to create-first + flip-after, with a sibling audit and read-time dedup so transient duplicate latest=true is benign and self-healing: - Create new snapshot (latest=true) before flipping the old one to false. - On AlreadyExists, Get via UncachedReader and verify OverrideHash; on mismatch emit a Warning event on the parent CRO/RO and return ExpectedBehaviorError so retries don't carry stack traces. - Tolerate IsNotFound on the demote Update (concurrent prune / parent deletion) so the sibling audit still runs. - New cleanupStaleLatestSiblings helper flips any older sibling that still carries latest=true; runs on both the hash-match and new-snapshot paths. - Read-time dedup in FetchAllMatchingOverridesForResourceSnapshot picks the highest OverrideIndexLabel per parent (CRO key: tracking label; RO key: namespace + tracking label). Malformed labels log loudly and skip rather than blocking the rollout hot path. - removeExtraSnapshot now trims its in-memory list in place so the same list is safe to reuse for the audit step. Boy Scout: typo and misleading-comment cleanups in the touched test files. Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
…ists tests - Wire a record.NewFakeRecorder into the CRO mismatched-hash test and assert the OverrideSnapshotHashMismatch Warning event is actually emitted (Copilot review kubefleet-dev#1). - Mirror both AlreadyExists branches (matching + mismatched hash) for the ResourceReconciler so the namespaced controller has parity coverage. The RO mismatched-hash test also asserts the Warning event on the parent RO. - Reword a stale comment in overrider_test.go that said the dedup path "errors" on malformed labels — it actually logs and skips (Copilot review kubefleet-dev#2). Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Add CRO and RO integration tests that exercise the hash-match short-circuit in ensureClusterResourceOverrideSnapshot / ensureResourceOverrideSnapshot. Pre-create a snapshot whose OverrideHash matches the parent's current spec hash; the controller must take the hash-match branch (no new snapshot, ensureSnapshotLatest re-flips, cleanupStaleLatestSiblings runs) instead of creating an extra snapshot. Lifts overrider package coverage from ~71% to ~73% by hitting the hash-match branch that the existing AlreadyExists tests bypassed. Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
- Reword the AlreadyExists comment in both controllers to drop the loose "etcd restore from backup" example (a full etcd restore brings the CRO and snapshots back together consistently, so it's not the realistic trigger). Name the actual triggers explicitly: a manual edit of the snapshot or a hash-function change between controller versions. - On hash mismatch, delete the existing snapshot before returning the error. The override snapshot is owned by the CRO/RO and the reconciler's job is to drive observed → desired state; if the existing snapshot's content doesn't match the spec, deleting it so the next reconcile recreates it cleanly is more correct than letting stale content persist. The Warning event message is updated to reflect this recovery. - Document why we trust the stored OverrideHash field rather than recomputing from existing.Spec.OverrideSpec. The API server applies CRD defaulting on Create (selectionScope, scope, overrideType), so the read-back spec differs from the writer's input and a recompute would never match the original-intent hash. - Reword the dedupLatestSnapshots sort comment. Within a single call all items share scope (one keyOf, one type), so the namespace tier is just a defensive tie-breaker — not a mixed-scope handler as the previous wording implied. - Tighten the rest of the new comments throughout the touched files. Tests: the CRO + RO mismatched-hash integration tests now construct a genuinely divergent spec and assert the bad snapshot is deleted via Eventually(IsNotFound). Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Add focused unit tests using controller-runtime fake client + Delete interceptors to exercise the three Delete branches in the AlreadyExists recovery path that envtest can't reach reliably: - Delete succeeds → ExpectedBehaviorError (recovery proceeds) - Delete returns IsNotFound → swallowed, ExpectedBehaviorError - Delete returns non-NotFound error → APIServerError Same coverage for both CRO and RO controllers. Lifts overrider controller package coverage from ~73.6% to ~74.3%. Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Three small fixes to the unit tests added in the previous commit: - Bump copyright year on the new files from 2025 to 2026. - Mirror the CRO test's deleteCalls counter and assertion in the RO test so all three branches verify the controller actually attempted the Delete, not just the returned error sentinel. - Replace mutation of the global k8s.io/client-go/kubernetes/scheme with a fresh runtime.NewScheme() per test for safe test isolation (matches the pattern in pkg/controllers/workgenerator/override_test.go). Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Four small fixes following Go-hygiene review: - Fix CRO test asymmetry: the "non-NotFound delete error" case was missing wantSnapDelete: true, so the controller's Delete attempt on the error branch wasn't asserted. Now matches the RO mirror. - Add t.Parallel() to both Test* functions and their table-driven subtests; fresh fakeClient per case makes parallelism safe and satisfies the tparallel linter. - Rename ae* test variables (aeReconciler, aeCRO, aeCROName, aeRO, aeROName) to recovery* in common_test.go for clarity. The "ae" abbreviation was opaque outside the immediate Context block. - Drop the r.recorder != nil guard in both controllers' hash-mismatch branch. The recorder is wired in SetupWithManager and tests that reach this path always set it; the guard was inconsistent with the rest of the package (workgenerator, workapplier all call r.recorder.Eventf directly). Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
0ff472d to
de99cf5
Compare
Description of your changes
When a
ClusterResourceOverrideorResourceOverrideis updated, the controller previously flippedis-latest-snapshot=falseon the old snapshot before creating the new one. The rollout snapshot watcher only handlesCreate/Genericevents (noUpdateFunc), so a crash between flip and create left rollout un-enqueued until the next resync (6 h hub-agent default) or parent-CR mutation, andFetchAllMatchingOverridesForResourceSnapshotwould silently drop the affected override.This PR reorders the controller to create-first, demote-after, with two additional safety steps so transient duplicate
latest=trueis benign:Created withlatest=truefirst; the previous snapshot is flipped tofalseafterwards.AlreadyExists, the existing snapshot is fetched viamgr.GetAPIReader()(uncached) and itsOverrideHashverified before proceeding. Mismatches surface as aWarningevent on the parent CRO/RO and anExpectedBehaviorError(no stack trace, requeues for transient causes like etcd restore).cleanupStaleLatestSiblingshelper flips any older sibling that still carrieslatest=true. Runs on both the hash-match and new-snapshot paths.FetchAllMatchingOverridesForResourceSnapshotdeduplicates at read time, picking the highestOverrideIndexLabelper parent (CRO key: tracking label; RO key: namespace + tracking label). Malformed labels log loudly and skip rather than blocking the entire rollout hot path.IsNotFoundon the demoteUpdateis tolerated so the sibling audit still runs after a concurrent prune or parent deletion.Fixes #
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
pkg/utils/overrider/overrider_test.gocover the generic dedup helper (CRO and RO, including the namespace-collision and malformed-label cases) and aFetchAllMatchingOverridesForResourceSnapshotintegration case with twolatest=truesnapshots for the same parent.pkg/controllers/overrider/common_test.gocover the sibling-audit helper (steady-state no-op, post-crash recovery) and theAlreadyExistsrecovery branch (matching hash → demote proceeds; mismatched hash →ExpectedBehaviorError+ Warning event, no demote).pkg/utils/overriderunit tests pass.Special notes for your reviewer
Placement == nilrollout watcher skip (pkg/controllers/rollout/controller.go:849, :882) and the same flip-then-create antipattern inpkg/utils/controller/resource_snapshot_resolver.goare left out of scope; both deserve their own follow-up issues.status.latestSnapshotNamepointer on the parent CRO/RO so labels stop being load-bearing for "current" semantics; that would be a separate, larger change.