Skip to content

fix: close override snapshot transition race on update#691

Open
Yetkin Timocin (ytimocin) wants to merge 7 commits into
kubefleet-dev:mainfrom
ytimocin:fix/override-snapshot-transition-race
Open

fix: close override snapshot transition race on update#691
Yetkin Timocin (ytimocin) wants to merge 7 commits into
kubefleet-dev:mainfrom
ytimocin:fix/override-snapshot-transition-race

Conversation

@ytimocin

Copy link
Copy Markdown
Collaborator

Description of your changes

When a ClusterResourceOverride or ResourceOverride is updated, the controller previously flipped is-latest-snapshot=false on the old snapshot before creating the new one. The rollout snapshot watcher only handles Create/Generic events (no UpdateFunc), so a crash between flip and create left rollout un-enqueued until the next resync (6 h hub-agent default) or parent-CR mutation, and FetchAllMatchingOverridesForResourceSnapshot would silently drop the affected override.

This PR reorders the controller to create-first, demote-after, with two additional safety steps so transient duplicate latest=true is benign:

  • New snapshot is Created with latest=true first; the previous snapshot is flipped to false afterwards.
  • On AlreadyExists, the existing snapshot is fetched via mgr.GetAPIReader() (uncached) and its OverrideHash verified before proceeding. Mismatches surface as a Warning event on the parent CRO/RO and an ExpectedBehaviorError (no stack trace, requeues for transient causes like etcd restore).
  • A new cleanupStaleLatestSiblings helper flips any older sibling that still carries latest=true. Runs on both the hash-match and new-snapshot paths.
  • FetchAllMatchingOverridesForResourceSnapshot deduplicates at read time, picking the highest OverrideIndexLabel per parent (CRO key: tracking label; RO key: namespace + tracking label). Malformed labels log loudly and skip rather than blocking the entire rollout hot path.
  • IsNotFound on the demote Update is tolerated so the sibling audit still runs after a concurrent prune or parent deletion.

Fixes #

I have:

  • Associated this change with a known KubeFleet Issue (Bug, Feature, etc).
  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

  • New unit tests in pkg/utils/overrider/overrider_test.go cover the generic dedup helper (CRO and RO, including the namespace-collision and malformed-label cases) and a FetchAllMatchingOverridesForResourceSnapshot integration case with two latest=true snapshots for the same parent.
  • New integration tests in pkg/controllers/overrider/common_test.go cover the sibling-audit helper (steady-state no-op, post-crash recovery) and the AlreadyExists recovery branch (matching hash → demote proceeds; mismatched hash → ExpectedBehaviorError + Warning event, no demote).
  • All existing override controller integration tests (envtest) and pkg/utils/overrider unit tests pass.

Special notes for your reviewer

  • Placement == nil rollout watcher skip (pkg/controllers/rollout/controller.go:849, :882) and the same flip-then-create antipattern in pkg/utils/controller/resource_snapshot_resolver.go are left out of scope; both deserve their own follow-up issues.
  • Long-term we should consider migrating to a status.latestSnapshotName pointer on the parent CRO/RO so labels stop being load-bearing for "current" semantics; that would be a separate, larger change.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race in the override controllers’ snapshot “latest” transition by switching from demote-old-then-create-new to create-new-first-then-demote-old, preventing a crash window where no snapshot is labeled is-latest-snapshot=true and rollout would not be enqueued until a long resync. To make the brief “double latest=true” window safe, it adds read-time deduplication in the overrider hot path and introduces a sibling-audit helper to clean up stale latest=true labels.

Changes:

  • Reordered CRO/RO snapshot updates to create-first, demote-after, including an AlreadyExists recovery path that verifies snapshot hash via an uncached reader and emits a Warning event on mismatch.
  • Added cleanupStaleLatestSiblings to flip stale latest=true labels on older snapshots, and added read-time dedup in FetchAllMatchingOverridesForResourceSnapshot.
  • Added/updated unit + integration tests for dedup logic and sibling cleanup behaviors; updated controller wiring to pass an uncached reader.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/utils/overrider/overrider.go Deduplicates “latest” override snapshots at read time to tolerate transient duplicates.
pkg/utils/overrider/overrider_test.go Adds unit tests for generic dedup + a fetch test covering duplicate latest snapshots.
pkg/controllers/overrider/common.go Adds UncachedReader, event recorder field, snapshot pruning slice trimming, and sibling cleanup helper.
pkg/controllers/overrider/common_test.go Adds integration tests for sibling cleanup and AlreadyExists recovery branches.
pkg/controllers/overrider/clusterresource_controller.go Implements create-first/demote-after ordering + AlreadyExists hash verification + event emission.
pkg/controllers/overrider/resource_controller.go Same ordering + recovery logic for namespaced overrides; sets up recorder.
pkg/controllers/overrider/suite_test.go Wires UncachedReader into envtest manager setup.
cmd/hubagent/workload/setup.go Wires UncachedReader into hub-agent override controller setup.

Comment thread pkg/controllers/overrider/common_test.go Outdated
Comment thread pkg/utils/overrider/overrider_test.go Outdated
@codecov

codecov Bot commented May 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.10345% with 39 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/controllers/overrider/resource_controller.go 57.50% 13 Missing and 4 partials ⚠️
...ontrollers/overrider/clusterresource_controller.go 62.16% 10 Missing and 4 partials ⚠️
pkg/controllers/overrider/common.go 71.42% 7 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@ytimocin

Copy link
Copy Markdown
Collaborator Author

E2E validation on Kind clusters

Ran four targeted edge-case tests against this branch on a fresh make setup-clusters (hub + 3 members).

Setup

  • make setup-clusters MEMBER_CLUSTER_COUNT=3 (fresh build, hub-agent rev cc4657694)
  • make join-members
  • 4 Kind clusters on Apple Silicon (Mac), envtest already covers the in-process unit/integration paths

Edge cases

# Test Verifies Result
1 Hub-agent restart mid-reconcile Recovery from a crash during snapshot churn (the original bug scenario) ✅ PASS
2 10 concurrent CRO updates Workqueue contention doesn't break invariants ✅ PASS
3 Inject duplicate latest=true sibling cleanupStaleLatestSiblings audit fires on next reconcile ✅ PASS
4 RO same name in two namespaces (namespace, OverrideTrackingLabel) dedup key isolates parents ✅ PASS

Sample results

Test 1 — after force-killing all hub-agent pods (--grace-period=0 --force) during 3 rapid CRO updates:

NAME                       LATEST   INDEX
edge-restart-cro-53101-0   false    0
edge-restart-cro-53101-1   false    1
edge-restart-cro-53101-2   true     2

Exactly one latest=true at the highest index — the create-first ordering recovered cleanly.

Test 2 — 10 patches submitted in parallel via &:

edge-concurrent-cro-53313-1  false  1
edge-concurrent-cro-53313-10 true   10   ← single latest
edge-concurrent-cro-53313-2  false  2
…
edge-concurrent-cro-53313-9  false  9

Test 3 — manually re-flipped snapshot 0 to latest=true to simulate a post-crash duplicate, then patched the CRO. Hub-agent log:

overrider/clusterresource_controller.go:187  "Created new overrideSnapshot"  newOverrideSnapshot="edge-dup-cro-53784-2"
overrider/clusterresource_controller.go:203  "Marked previous overrideSnapshot as inactive"  overrideSnapshot="edge-dup-cro-53784-1"
overrider/common.go:198                       "Flipped stale latest sibling to false"        overrideSnapshot="edge-dup-cro-53784-0"

The new audit helper fired exactly once on the injected duplicate.

Test 4 — two ResourceOverrides named edge-ro-shared-… in two different namespaces. Updating one produced index-1 in that namespace only; the other namespace's RO snapshot stayed at index 0.

Hub-agent log scan across all four runs

  • Created new overrideSnapshot: 19 occurrences (create-first ordering live)
  • Marked previous overrideSnapshot as inactive: 14 occurrences (demote-after live)
  • Flipped stale latest sibling to false: 1 occurrence (audit helper from test 3)
  • OverrideSnapshotHashMismatch events: 0
  • recovering from prior partial reconcile: 0
  • Failed to create new overrideSnapshot: 0
  • Failed to set the isLatestSnapshot label to false: 0
  • panics: 0

Side observation

The override controller uses predicate.GenerationChangedPredicate{}, so reconcile only fires on spec changes — not annotation/label edits. This means a duplicate latest=true left behind by a crash will persist until either (a) the next CRO/RO spec change, or (b) the informer's full-list resync (6 h hub-agent default) emits a synthetic Create event. Read-time dedup in FetchAllMatchingOverridesForResourceSnapshot masks this in the meantime, so callers always see the correct authoritative snapshot — just worth noting that the on-disk cleanup is gated on the next reconcile, not wall-clock immediate.

Baseline make e2e-tests caveat

The full suite under -p (9 parallel workers vs 4 Kind clusters on a single Mac) produced ~99 failures distributed uniformly across all placement test types. Failures were 10-second Eventually() timeouts — environmental, not regressions. No override-controller errors in hub-agent logs during the run. Should not affect CI/dedicated runners with proper resource isolation.

}
// AlreadyExists here normally means a prior reconcile succeeded the Create but failed to
// flip the old snapshot. Verify the existing object's hash before proceeding so we don't
// silently promote stale content (etcd restore from backup, manual edit, future hash bug).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a question: why would "etcd restore from backup" result in AlreadyExists path?

What does "future hash bug" mean? Does it mean hash function change?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 1st question, I was trying to be extra defensive by thinking about a partial restore. And, for the 2nd question, yes, I was thinking about a hash function change in between versions.

klog.ErrorS(getErr, "Failed to get existing overrideSnapshot for hash verification", "resourceOverride", roKObj, "newOverrideSnapshot", klog.KObj(newSnapshot))
return controller.NewAPIServerError(false, getErr)
}
if string(existing.Spec.OverrideHash) != overrideSpecHash {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can recompute overrideHash of existing to avoid situations where someone modifies the overrideHash?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API Server applies defaults on Create, and, that is why, the hashes would be different between Create and re-hash of a read-back object.

The defaults come from kubebuilder:default markers on the API types — that's what makes the original and read-back objects differ.

Comment thread pkg/controllers/overrider/resource_controller.go Outdated
// The content has not changed; no new snapshot is needed. Ensure the highest-index
// snapshot is marked latest, then audit siblings to clean up any duplicate latest=true
// left by a prior partial run.
if err := r.ensureSnapshotLatest(ctx, latestSnapshot); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would we need to mark the snapshot as latest?

If we always create a new snapshot with is-latest = true, the latest snapshot already has is-latest label set to true even if a crash happened right after a new snapshot was created?

Want to understand the scenario that you have in mind so that we can think about how to handle.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensureSnapshotLatest was already here before this PR — I just kept calling it from the hash-match path as defensive insurance against out-of-band label edits. It's a no-op in steady state. Want me to drop it as part of this PR, or leave it alone since it's pre-existing?

Comment thread pkg/utils/overrider/overrider.go Outdated
@ytimocin Yetkin Timocin (ytimocin) force-pushed the fix/override-snapshot-transition-race branch 2 times, most recently from a01c978 to 203995e Compare May 6, 2026 16:47

@michaelawyu michaelawyu (michaelawyu) left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Yetkin! I added some comments; would be really happy to discuss about this issue further.

if len(snapshotList.Items) != 0 {
// convert the last unstructured snapshot to the typed object
latestSnapshot := &placementv1beta1.ClusterResourceOverrideSnapshot{}
latestSnapshot = &placementv1beta1.ClusterResourceOverrideSnapshot{}

@michaelawyu michaelawyu (michaelawyu) May 8, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Yetkin! Thanks so much for this. It is a very interesting PR and I know that such fixes take a long while and a lot of efforts to complete. It is just that, I couldn't help feeling that given the nature of the bug, i.e., rollout controller failing to catch the state change where a resource snapshot is marked as latest again, it can just be easily fixed by tweaking the handler funcs in the rollout controller a bit:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system just need an UpdateFunc that enqueues placement if it sees that for a snapshot update event, (e.ObjectOld.GetLabels()[isLatestLabelName] != e.ObjectNew.GetLabels()[isLatestLabelName]) && e.ObjectNew.GetLabels()[isLatestLabelName] == "true" (very roughly written).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(There is actually a bit more to this bug -> please see some of the other comments, happy to discuss about this a bit more and make sure that I understand it correctly)

klog.ErrorS(err, "Failed to set the isLatestSnapshot label to false", "clusterResourceOverride", croKObj, "overrideSnapshot", klog.KObj(latestSnapshot))
return controller.NewUpdateIgnoreConflictError(err)
// Hash matches: re-mark this snapshot latest (idempotent) and audit siblings.
if err := r.ensureSnapshotLatest(ctx, latestSnapshot); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a bit interesting I think. If we do switch to the create-then-demote model, this part is a bit redundant (it is either always a no-op or always fail due to conflicts). And it cannot be reliably used to correct unexpected cases either (e.g., a latest label is set to false unexpectedly), as it would also trigger the issue we want to fix, i.e., the rollout controller missing the update.

"existing snapshot %s has different content than the current spec; deleting it for the next reconcile to recreate. If this persists, investigate the snapshot for manual edits or a hash-function change between controller versions.", newSnapshot.Name)
// Drive observed → desired: delete the mismatched snapshot so the next reconcile
// recreates it with the correct content.
if delErr := r.Client.Delete(ctx, existing); delErr != nil && !errors.IsNotFound(delErr) {

@michaelawyu michaelawyu (michaelawyu) May 8, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deletion part is a bit problematic I fear.

Aside from the semantic concerns (the snapshots should have monotonically, continuous indexes), the snapshot we would like to delete might have already been referenced in some bindings.

@michaelawyu michaelawyu (michaelawyu) May 8, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might lead to the work generator getting stuck in a loop of errors when processing some bindings as it can no longer find a referenced snapshot. As our rollout (mostly rolling update) checks for Applied/Availability info, it might also lead a rollout to be stuck.

// cycle until the snapshot's labels are repaired.
//
// PT is *T constrained to client.Object so we can extract metadata and labels without copying.
func dedupLatestSnapshots[T any, PT interface {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Yetkin! This part is particularly interesting I think.

One of the major benefits that the demote then create pattern yields is that, it guarantees that at any given point of watches (with the cache lagging or not), the consumer (whoever needs to read/get notified on the appearance of a latest snapshot) will either see:

a) there is no latest snapshot -> which means the system observes an intermediate state and can simply wait for the next event; or
b) there is exactly one latest snapshot.

So the consumer can just do a list op with a label selector, and decide what to do based on the number of items returned. No additional sorting/ranking is needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g,

0 (latest) - | -> 0 (demoted) - | -> 1 (latest) - | -> 1 (demoted) - | -> 2 (latest) - | -> ...

| = time point where the consumer reads data

@michaelawyu michaelawyu (michaelawyu) May 8, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But with the create then demote pattern,

e.g,

0 (latest) - | -> 1 (latest) - | -> 0 (demoted) - | -> 2 (latest) - | -> 1 (demoted) - | -> ...

Roughly (very roughly) 50% of the time, the consumer will see that there are multiple (2) latest snapshots. And it must do the sorting/ranking to determine the actual latest snapshot (or it can just back off until the situation is resolved by the snapshotting controller). The correctness part, as Yetkin you have acutely catches in this part of the code, can still be guaranteed, but it kinds of defeats the purpose of having the latest label. As the source of truth always lies in the monotonically increasing indexes, the consumer can simply just list every snapshot relevant and do the sorting/ranking every time to find the latest snapshot -> it's more costly, but more straightforward.

// Compare stored OverrideHash, not a re-hash of existing.Spec.OverrideSpec: API-server
// defaulting on Create (selectionScope, scope, overrideType) makes the read-back spec
// differ from the input, so a recompute would never match.
if string(existing.Spec.OverrideHash) != overrideSpecHash {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is particularly tricky I assume (the flow itself, not specific to this PR)...

Generally speaking, if the user makes different edits each time (A -> B -> C -> ...) and we do not consider hash collision an issue, then this branch will never run and the issue of a past snapshot being marked as latest again will never occur.

The problem will only happen when the user does an A -> B -> A fashion of edits:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a demote then create flow, there are two possibilities:

case 1 =>

there is a snapshot 1 with hash A -> the user edits overrides that has hash B -> the controller is triggered -> the controller has snapshot 1 demoted, but failed to create snapshot 2 with hash B -> the user edits overrides back to the state of hash A -> the controller is triggered again and marked snapshot 1 as latest

case 1 is harmless, as it still aligns with our eventually consistent model; the system is just behaving as if the edit of hash B never happened. There is no rollout needed either.

case 2 =>

there is a snapshot 1 with hash A -> the user edits overrides that has hash B -> the controller is triggered -> the controller has snapshot 1 demoted, and created snapshot 2 with hash B -> the user edits overrides back to the state of hash A -> the controller is triggered again, but due to cache lagging, it just sees snapshot 1 in its demoted state, snapshot 2 hasn't appeared yet -> the controller marks snapshot 1 as latest again -> the system is trapped in an erred state with two latest snapshots

case 2 is harmful; the consumer side needs to handle this by doing its own sorting/ranking, and the producer side needs to fix the situation in later reconciliations.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is not a demote then create flow specific problem -> create then demote can have this as well. And this PR has fixed this by adding additional logic to clean up the history and setting the consumer to de-dup as needed.

But these occurrences are rare in nature -> if we bring object generation into the hash calculation the chances would get even more slim (to the brim of being impossible). Instead of having the cleanup logic and de-duppiing run often, I wonder if it would actually be easier to just:

a) on the consumer side, list with latest label selector, and back off if there are multiple latest snapshots -> wait for the producer side to fix things
b) on the producer side, hash with object generation embedded, and in very rare cases where multiple latest snapshots do appear, do the cleanup.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important to note that any change we make must guarantee that the new flow won't trigger rollouts on existing placements, so the hashing part might not be easily editable.

@ytimocin Yetkin Timocin (ytimocin) force-pushed the fix/override-snapshot-transition-race branch from c452a28 to 0ff472d Compare May 9, 2026 02:35
The override controllers previously flipped is-latest-snapshot=false on the
old snapshot before creating the new one. Because the rollout watcher only
handles Create/Generic events (no UpdateFunc), a crash between flip and
create left rollout un-enqueued until the next resync (6h hub-agent default)
or parent-CR mutation, silently dropping overrides on member clusters.

Reorder to create-first + flip-after, with a sibling audit and read-time
dedup so transient duplicate latest=true is benign and self-healing:

- Create new snapshot (latest=true) before flipping the old one to false.
- On AlreadyExists, Get via UncachedReader and verify OverrideHash; on
  mismatch emit a Warning event on the parent CRO/RO and return
  ExpectedBehaviorError so retries don't carry stack traces.
- Tolerate IsNotFound on the demote Update (concurrent prune / parent
  deletion) so the sibling audit still runs.
- New cleanupStaleLatestSiblings helper flips any older sibling that still
  carries latest=true; runs on both the hash-match and new-snapshot paths.
- Read-time dedup in FetchAllMatchingOverridesForResourceSnapshot picks the
  highest OverrideIndexLabel per parent (CRO key: tracking label;
  RO key: namespace + tracking label). Malformed labels log loudly and skip
  rather than blocking the rollout hot path.
- removeExtraSnapshot now trims its in-memory list in place so the same
  list is safe to reuse for the audit step.

Boy Scout: typo and misleading-comment cleanups in the touched test files.

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
…ists tests

- Wire a record.NewFakeRecorder into the CRO mismatched-hash test and assert
  the OverrideSnapshotHashMismatch Warning event is actually emitted
  (Copilot review kubefleet-dev#1).
- Mirror both AlreadyExists branches (matching + mismatched hash) for the
  ResourceReconciler so the namespaced controller has parity coverage. The
  RO mismatched-hash test also asserts the Warning event on the parent RO.
- Reword a stale comment in overrider_test.go that said the dedup path
  "errors" on malformed labels — it actually logs and skips (Copilot
  review kubefleet-dev#2).

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Add CRO and RO integration tests that exercise the hash-match short-circuit
in ensureClusterResourceOverrideSnapshot / ensureResourceOverrideSnapshot.
Pre-create a snapshot whose OverrideHash matches the parent's current spec
hash; the controller must take the hash-match branch (no new snapshot,
ensureSnapshotLatest re-flips, cleanupStaleLatestSiblings runs) instead of
creating an extra snapshot.

Lifts overrider package coverage from ~71% to ~73% by hitting the hash-match
branch that the existing AlreadyExists tests bypassed.

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
- Reword the AlreadyExists comment in both controllers to drop the
  loose "etcd restore from backup" example (a full etcd restore brings
  the CRO and snapshots back together consistently, so it's not the
  realistic trigger). Name the actual triggers explicitly: a manual
  edit of the snapshot or a hash-function change between controller
  versions.

- On hash mismatch, delete the existing snapshot before returning the
  error. The override snapshot is owned by the CRO/RO and the
  reconciler's job is to drive observed → desired state; if the
  existing snapshot's content doesn't match the spec, deleting it so
  the next reconcile recreates it cleanly is more correct than letting
  stale content persist. The Warning event message is updated to
  reflect this recovery.

- Document why we trust the stored OverrideHash field rather than
  recomputing from existing.Spec.OverrideSpec. The API server applies
  CRD defaulting on Create (selectionScope, scope, overrideType), so
  the read-back spec differs from the writer's input and a recompute
  would never match the original-intent hash.

- Reword the dedupLatestSnapshots sort comment. Within a single call
  all items share scope (one keyOf, one type), so the namespace tier
  is just a defensive tie-breaker — not a mixed-scope handler as the
  previous wording implied.

- Tighten the rest of the new comments throughout the touched files.

Tests: the CRO + RO mismatched-hash integration tests now construct a
genuinely divergent spec and assert the bad snapshot is deleted via
Eventually(IsNotFound).

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Add focused unit tests using controller-runtime fake client + Delete
interceptors to exercise the three Delete branches in the AlreadyExists
recovery path that envtest can't reach reliably:

- Delete succeeds → ExpectedBehaviorError (recovery proceeds)
- Delete returns IsNotFound → swallowed, ExpectedBehaviorError
- Delete returns non-NotFound error → APIServerError

Same coverage for both CRO and RO controllers. Lifts overrider
controller package coverage from ~73.6% to ~74.3%.

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Three small fixes to the unit tests added in the previous commit:

- Bump copyright year on the new files from 2025 to 2026.
- Mirror the CRO test's deleteCalls counter and assertion in the RO
  test so all three branches verify the controller actually attempted
  the Delete, not just the returned error sentinel.
- Replace mutation of the global k8s.io/client-go/kubernetes/scheme
  with a fresh runtime.NewScheme() per test for safe test isolation
  (matches the pattern in pkg/controllers/workgenerator/override_test.go).

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Four small fixes following Go-hygiene review:

- Fix CRO test asymmetry: the "non-NotFound delete error" case was
  missing wantSnapDelete: true, so the controller's Delete attempt on
  the error branch wasn't asserted. Now matches the RO mirror.
- Add t.Parallel() to both Test* functions and their table-driven
  subtests; fresh fakeClient per case makes parallelism safe and
  satisfies the tparallel linter.
- Rename ae* test variables (aeReconciler, aeCRO, aeCROName, aeRO,
  aeROName) to recovery* in common_test.go for clarity. The "ae"
  abbreviation was opaque outside the immediate Context block.
- Drop the r.recorder != nil guard in both controllers' hash-mismatch
  branch. The recorder is wired in SetupWithManager and tests that
  reach this path always set it; the guard was inconsistent with
  the rest of the package (workgenerator, workapplier all call
  r.recorder.Eventf directly).

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
@ytimocin Yetkin Timocin (ytimocin) force-pushed the fix/override-snapshot-transition-race branch from 0ff472d to de99cf5 Compare June 2, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants