Skip to content

fix(experiment): propagate fresh ResourceVersion after spec rollback#2930

Open
tbavelier wants to merge 6 commits intomainfrom
tbavelier/fix-experiment-test-flakes
Open

fix(experiment): propagate fresh ResourceVersion after spec rollback#2930
tbavelier wants to merge 6 commits intomainfrom
tbavelier/fix-experiment-test-flakes

Conversation

@tbavelier
Copy link
Copy Markdown
Member

@tbavelier tbavelier commented Apr 24, 2026

What does this PR do?

Persists the terminal experiment phase (timeout/rollback) in the same reconcile as the spec rollback, via a targeted Status().Patch() guarded by a JSON patch test op on status.experiment.phase. This eliminates a cross-reconcile race where the terminal phase could be silently dropped, without compromising concurrent-write safety or crash-recovery of the rollback annotation.

Motivation

Test_Experiment_StateTransitions subtests (rollback_then_timeout, promoted_then_timeout, with/without stale_old_revision) were flaking on CI — failing with expected=timeout, actual=running at controller_experiment_integration_test.go:947. Reproduced locally (5+ failures in 20 runs).

Root cause: reconcile 1 of the timeout path updated spec (bumping ResourceVersion) but the full-status write used the stale RV → 409 → phase=running persisted. On reconcile 2, the refetched spec already matched a freshly-timestamped, rollback-annotated revision, so findMostRecentMatchingRevision returned it, elapsed < timeout, and the terminal phase was never written.

Fix

restorePreviousSpec now writes the terminal phase via a targeted status subresource JSON patch (patchExperimentPhase) after the spec rollback. The patch is narrow (only status.experiment.phase) and carries a test op that asserts the server's phase still matches the phase the rollback decision was based on. Four safety properties are preserved:

  1. Lands in the same reconcile: the targeted patch bypasses the RV bump from the spec Update, so phase=timeout/phase=rollback persists without waiting for a follow-up reconcile.
  2. Concurrent phase writes survive: if the RC daemon (e.g. pkg/fleet/daemon.go:354 writing phase=promoted) moves the phase between our reconcile fetch and our patch, the test op fails and the patch is rejected — our stale decision does not clobber their accepted state. Mirrors the 409-and-retry semantics of a full-status Update.
  3. Concurrent writes to other status fields survive: the patch is applied to an instance.DeepCopy() scratch so the client's response decoding does not advance the caller's instance.ResourceVersion. The downstream full-status write in updateStatusIfNeededV2 therefore retains its own 409 protection for every other status field.
  4. Crash-safe annotation ordering: annotateRevision runs BEFORE patchExperimentPhase. If the operator exits between the two writes, the phase is still non-terminal on restart, handleRollback re-fires, annotateRevision is idempotent, and both writes eventually succeed. Reversing this order would leave a terminal phase without the rollback annotation — and since handleRollback ignores terminal phases, the annotation would never be applied, causing a later re-apply of the same spec to fire an immediate false timeout on the stale CreationTimestamp.

The patch is best-effort — any error (test op failure or transient) is logged at V(1) and the next reconcile re-computes and retries.

Additional Notes

  • restorePreviousSpec signature change: now takes *v2alpha1.DatadogAgent instead of metav1.ObjectMeta (needed for Status().Patch()).
  • rollback signature unchanged (metav1.ObjectMeta).
  • TestRestorePreviousSpec_PhaseSetOnlyOnSuccess was updated to populate instance.Status.Experiment (the new observed-phase capture in restorePreviousSpec dereferences it — safe in production because handleRollback guards against nil before delegating, but needed for this direct unit-test call).
  • Regression guards added:
    • Test_Experiment_RollbackPersistsInSingleReconcile — asserts spec rollback and phase=timeout both persist in a single reconcile.
    • Test_Experiment_RollbackPreservesConcurrentStatusWrite — asserts restorePreviousSpec does not mutate the caller's ResourceVersion (bisected: reverting the scratch-copy fails this test).
    • Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite — asserts a concurrent phase=promoted write survives a stale phase=rollback decision, AND that the rollback annotation still lands even when the phase patch is rejected (bisected: reverting to an unconditional merge patch fails the first assertion).

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: n/a
  • Cluster Agent: n/a

Describe your test plan

  • Reproduced the flake: go test ./internal/controller/datadogagent/ -run Test_Experiment_StateTransitions -count=20 — multiple subtest failures before fix.
  • Validated after fix: Test_Experiment_StateTransitions + Test_Experiment_RollbackPersistsInSingleReconcile + Test_Experiment_RollbackPreservesConcurrentStatusWrite + Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite -count=20 — all green.
  • Full experiment suite (Test_Experiment_|TestManageExperiment|TestHandleRollback|TestRollback|TestRestorePreviousSpec|TestReapplySameSpecAfterRollback) -count=10 — clean.
  • Full internal/controller/datadogagent package — clean.

Checklist

  • PR has at least one valid label: bug
  • PR has a milestone: v1.26.0 (also qa/skip-qa — test-infrastructure stability, no user-facing change)
  • All commits are signed

🤖 Generated with Claude Code

After a timeout/stopped rollback, the spec Update bumped the DDA
ResourceVersion but the caller's in-memory instance kept the stale RV,
causing the status write to 409 and defer the terminal phase to the
next reconcile. On that next reconcile the spec already matched a
freshly-timestamped, rollback-annotated revision, so findMostRecent
returned it, elapsed < timeout, and the terminal phase was never
persisted — a race that made integration tests flaky on CI.

restorePreviousSpec/rollback now take *DatadogAgent and update
instance.ResourceVersion after the client.Update (or match), so the
status write in the same reconcile succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tbavelier tbavelier added this to the v1.26.0 milestone Apr 24, 2026
@tbavelier tbavelier added bug Something isn't working qa/skip-qa labels Apr 24, 2026
@tbavelier tbavelier requested a review from a team April 24, 2026 08:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: acd2abafde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Propagate the post-Update ResourceVersion so the downstream status write
// in the same reconcile uses the fresh RV and the terminal phase is
// persisted without requiring a follow-up reconcile.
instance.ResourceVersion = toUpdate.ResourceVersion
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve status concurrency when refreshing the RV

When another status writer updates this DatadogAgent after the reconcile read but before this rollback update (for example RC changing .status.experiment), copying the post-update ResourceVersion onto the stale in-memory object makes updateStatusIfNeededV2 later deep-copy and Status().Update the old status without getting the conflict that would normally protect that concurrent change. Please merge the terminal experiment phase into the freshly fetched status or use a targeted status patch instead of only transplanting the RV.

Useful? React with 👍 / 👎.

@datadog-prod-us1-3

This comment has been minimized.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.71%. Comparing base (f11b43c) to head (0dadba1).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2930      +/-   ##
==========================================
+ Coverage   40.69%   40.71%   +0.01%     
==========================================
  Files         321      321              
  Lines       28413    28420       +7     
==========================================
+ Hits        11563    11570       +7     
  Misses      16015    16015              
  Partials      835      835              
Flag Coverage Δ
unittests 40.71% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/controller/datadogagent/experiment.go 85.92% <100.00%> (+0.76%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f11b43c...0dadba1. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tbavelier and others added 2 commits April 24, 2026 10:32
Replace the ResourceVersion transplant with a targeted status subresource
merge patch that only writes the terminal experiment phase. This
addresses the codex review finding: propagating the post-Update
ResourceVersion onto the caller's stale in-memory instance caused
updateStatusIfNeededV2 to silently overwrite any concurrent status
writes made between our initial fetch and our spec Update (e.g. RC
daemon updating status.experiment).

The targeted patch lands the terminal phase in the same reconcile as
the rollback (fixing the original flake) while leaving the full-status
write's 409 protection intact for every other status field. The patch
is best-effort — a transient failure just defers the phase write to
the next reconcile, same as before.

Adds Test_Experiment_RollbackPersistsInSingleReconcile regression guard
asserting both the spec rollback and phase=timeout persist in a single
reconcile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rollback() no longer mutates ResourceVersion, so it doesn't need the
typed pointer — revert to the original metav1.ObjectMeta signature for
a more minimal diff. restorePreviousSpec keeps the *DatadogAgent
signature since it needs the typed object for Status().Patch().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tbavelier
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 851f223cbe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// reconcile will re-compute and retry.
func (r *Reconciler) patchExperimentPhase(ctx context.Context, instance *v2alpha1.DatadogAgent, phase v2alpha1.ExperimentPhase) {
patch := fmt.Appendf(nil, `{"status":{"experiment":{"phase":%q}}}`, phase)
if err := r.client.Status().Patch(ctx, instance, client.RawPatch(types.MergePatchType, patch)); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid refreshing the status RV before the full status update

When this raw status patch succeeds it decodes the server response back into instance, advancing instance.ResourceVersion even though newStatus was computed from the object fetched at the start of reconcile. If Fleet Automation/RC updates any status field (for example a new experiment ID or phase) between that initial fetch and this patch, the later updateStatusIfNeededV2 full Status().Update will use the patched RV and replace status with stale newStatus instead of 409'ing as before, clobbering the concurrent status write. Patch into a scratch object or refetch/merge before the full status update so other status fields keep their conflict protection.

Useful? React with 👍 / 👎.

controller-runtime's Status().Patch decodes the server response back
into the passed object, advancing ResourceVersion. Calling it on the
caller's *DatadogAgent instance would let the subsequent full-status
write in updateStatusIfNeededV2 use the patched RV and silently
overwrite any status field that a concurrent writer (e.g. the RC
daemon) updated between our initial reconcile fetch and this patch.

Patch into a DeepCopy scratch so instance.ResourceVersion stays at the
value observed at reconcile start. The full-status write's 409
protection is retained, and the terminal experiment phase still lands
in the same reconcile via the targeted patch.

Adds Test_Experiment_RollbackPreservesConcurrentStatusWrite regression
guard asserting instance.ResourceVersion is not mutated by
restorePreviousSpec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tbavelier
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b2aa3b359

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

func (r *Reconciler) patchExperimentPhase(ctx context.Context, instance *v2alpha1.DatadogAgent, phase v2alpha1.ExperimentPhase) {
patch := fmt.Appendf(nil, `{"status":{"experiment":{"phase":%q}}}`, phase)
scratch := instance.DeepCopy()
if err := r.client.Status().Patch(ctx, scratch, client.RawPatch(types.MergePatchType, patch)); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not overwrite concurrent experiment phase updates

When the RC daemon accepts a promote/stop after this reconcile fetched instance but before this patch runs, this raw merge patch still writes status.experiment.phase from the stale decision without any resourceVersion/phase precondition. For example, pkg/fleet/experiment.go allows promote from running and pkg/fleet/daemon.go persists phase=promoted; this patch can then change that accepted promotion back to timeout/rollback, whereas the previous full status update would have conflicted and retried. That can leave Fleet believing the experiment was promoted while the DDA status/spec reflect a rollback.

Useful? React with 👍 / 👎.

The raw merge patch wrote status.experiment.phase without any
precondition. If a concurrent writer (e.g. the RC daemon accepting a
promotion at pkg/fleet/daemon.go:354) moved the phase between the
reconcile's initial fetch and the restorePreviousSpec patch, the merge
patch would silently overwrite their accepted state with our stale
terminal-phase decision — potentially flipping an accepted promotion
back to timeout/rollback and leaving Fleet's view inconsistent with
the DDA.

Switch to a JSON patch with a `test` op that asserts the server's
current phase still matches the phase the rollback decision was based
on. If it doesn't, the test fails, the patch is rejected, and the
concurrent writer's state survives — mirroring the 409-and-retry
semantics of a full-status Update.

Adds Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite
regression guard; fixes TestRestorePreviousSpec_PhaseSetOnlyOnSuccess
to populate instance.Status.Experiment (required by the new observed-
phase capture in restorePreviousSpec).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tbavelier
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0deaf5e48d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return err
}
newStatus.Experiment.Phase = terminalPhase
r.patchExperimentPhase(ctx, instance, observedPhase, terminalPhase)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Annotate rollback revision before persisting terminal phase

When the operator process exits after this patch succeeds but before the annotateRevision call below runs, the terminal phase has already been persisted. On restart, handleRollback ignores timeout/rollback, so the highest experiment revision never gets annotationExperimentRollback; re-applying the same spec later can hit the stale CreationTimestamp path and immediately time out. Persist the phase only after the annotation attempt, or make terminal-phase reconciles repair a missing rollback annotation.

Useful? React with 👍 / 👎.

The previous order was: patch terminal phase → annotate experiment
revision. If the operator crashed between these two writes, the
server held a terminal phase but the revision was unannotated. On
restart, handleRollback ignores terminal phases (the switch only
handles Stopped/Running), so the annotation would never be applied —
a later re-apply of the same spec would then hit the stale
CreationTimestamp path in findMostRecentMatchingRevision and fire an
immediate false timeout.

Swap to annotate-then-patch. If a crash happens between them now,
the phase is still non-terminal on restart, handleRollback re-fires,
annotateRevision is idempotent (early-returns when already annotated),
and patchExperimentPhase runs again. Eventually both writes succeed.

Also extends Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite
to assert the annotation lands regardless of whether the phase patch
was rejected by the test op — the annotation is protection for future
re-applies and must not be gated on phase-write success.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tbavelier
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0dadba1e3c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Best-effort: any error here (test failure or transient) is logged at V(1)
// and the next reconcile will re-compute and retry.
func (r *Reconciler) patchExperimentPhase(ctx context.Context, instance *v2alpha1.DatadogAgent, expectedCurrent, newPhase v2alpha1.ExperimentPhase) {
patch := fmt.Appendf(nil, `[{"op":"test","path":"/status/experiment/phase","value":%q},{"op":"replace","path":"/status/experiment/phase","value":%q}]`, expectedCurrent, newPhase)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard phase patch with experiment identity

When Fleet moves the same DDA through another transition that returns to the same phase before this patch runs—for example promoting the timed-out experiment and then starting a new one, which records a different status.experiment.id with phase=running—this test still passes because it only checks phase. The raw patch then marks the newer experiment timeout/rollback, whereas the previous RV-protected status update would have conflicted instead of overwriting a newer status; include the observed experiment ID (or the whole experiment status) in the patch precondition.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants