fix(experiment): propagate fresh ResourceVersion after spec rollback by tbavelier · Pull Request #2930 · DataDog/datadog-operator

tbavelier · 2026-04-24T08:18:44Z

What does this PR do?

Persists the terminal experiment phase (timeout/rollback) in the same reconcile as the spec rollback, via a targeted Status().Patch() guarded by a JSON patch test op on status.experiment.phase. This eliminates a cross-reconcile race where the terminal phase could be silently dropped, without compromising concurrent-write safety or crash-recovery of the rollback annotation.

Motivation

Test_Experiment_StateTransitions subtests (rollback_then_timeout, promoted_then_timeout, with/without stale_old_revision) were flaking on CI — failing with expected=timeout, actual=running at controller_experiment_integration_test.go:947. Reproduced locally (5+ failures in 20 runs).

Root cause: reconcile 1 of the timeout path updated spec (bumping ResourceVersion) but the full-status write used the stale RV → 409 → phase=running persisted. On reconcile 2, the refetched spec already matched a freshly-timestamped, rollback-annotated revision, so findMostRecentMatchingRevision returned it, elapsed < timeout, and the terminal phase was never written.

Fix

restorePreviousSpec now writes the terminal phase via a targeted status subresource JSON patch (patchExperimentPhase) after the spec rollback. The patch is narrow (only status.experiment.phase) and carries a test op that asserts the server's phase still matches the phase the rollback decision was based on. Four safety properties are preserved:

Lands in the same reconcile: the targeted patch bypasses the RV bump from the spec Update, so phase=timeout/phase=rollback persists without waiting for a follow-up reconcile.
Concurrent phase writes survive: if the RC daemon (e.g. pkg/fleet/daemon.go:354 writing phase=promoted) moves the phase between our reconcile fetch and our patch, the test op fails and the patch is rejected — our stale decision does not clobber their accepted state. Mirrors the 409-and-retry semantics of a full-status Update.
Concurrent writes to other status fields survive: the patch is applied to an instance.DeepCopy() scratch so the client's response decoding does not advance the caller's instance.ResourceVersion. The downstream full-status write in updateStatusIfNeededV2 therefore retains its own 409 protection for every other status field.
Crash-safe annotation ordering: annotateRevision runs BEFORE patchExperimentPhase. If the operator exits between the two writes, the phase is still non-terminal on restart, handleRollback re-fires, annotateRevision is idempotent, and both writes eventually succeed. Reversing this order would leave a terminal phase without the rollback annotation — and since handleRollback ignores terminal phases, the annotation would never be applied, causing a later re-apply of the same spec to fire an immediate false timeout on the stale CreationTimestamp.

The patch is best-effort — any error (test op failure or transient) is logged at V(1) and the next reconcile re-computes and retries.

Additional Notes

restorePreviousSpec signature change: now takes *v2alpha1.DatadogAgent instead of metav1.ObjectMeta (needed for Status().Patch()).
rollback signature unchanged (metav1.ObjectMeta).
TestRestorePreviousSpec_PhaseSetOnlyOnSuccess was updated to populate instance.Status.Experiment (the new observed-phase capture in restorePreviousSpec dereferences it — safe in production because handleRollback guards against nil before delegating, but needed for this direct unit-test call).
Regression guards added:
- Test_Experiment_RollbackPersistsInSingleReconcile — asserts spec rollback and phase=timeout both persist in a single reconcile.
- Test_Experiment_RollbackPreservesConcurrentStatusWrite — asserts restorePreviousSpec does not mutate the caller's ResourceVersion (bisected: reverting the scratch-copy fails this test).
- Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite — asserts a concurrent phase=promoted write survives a stale phase=rollback decision, AND that the rollback annotation still lands even when the phase patch is rejected (bisected: reverting to an unconditional merge patch fails the first assertion).

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

Agent: n/a
Cluster Agent: n/a

Describe your test plan

Reproduced the flake: go test ./internal/controller/datadogagent/ -run Test_Experiment_StateTransitions -count=20 — multiple subtest failures before fix.
Validated after fix: Test_Experiment_StateTransitions + Test_Experiment_RollbackPersistsInSingleReconcile + Test_Experiment_RollbackPreservesConcurrentStatusWrite + Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite -count=20 — all green.
Full experiment suite (Test_Experiment_|TestManageExperiment|TestHandleRollback|TestRollback|TestRestorePreviousSpec|TestReapplySameSpecAfterRollback) -count=10 — clean.
Full internal/controller/datadogagent package — clean.

Checklist

PR has at least one valid label: bug
PR has a milestone: v1.26.0 (also qa/skip-qa — test-infrastructure stability, no user-facing change)
All commits are signed

🤖 Generated with Claude Code

After a timeout/stopped rollback, the spec Update bumped the DDA ResourceVersion but the caller's in-memory instance kept the stale RV, causing the status write to 409 and defer the terminal phase to the next reconcile. On that next reconcile the spec already matched a freshly-timestamped, rollback-annotated revision, so findMostRecent returned it, elapsed < timeout, and the terminal phase was never persisted — a race that made integration tests flaky on CI. restorePreviousSpec/rollback now take *DatadogAgent and update instance.ResourceVersion after the client.Update (or match), so the status write in the same reconcile succeeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: acd2abafde

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T08:24:22Z

+	// Propagate the post-Update ResourceVersion so the downstream status write
+	// in the same reconcile uses the fresh RV and the terminal phase is
+	// persisted without requiring a follow-up reconcile.
+	instance.ResourceVersion = toUpdate.ResourceVersion


Preserve status concurrency when refreshing the RV

When another status writer updates this DatadogAgent after the reconcile read but before this rollback update (for example RC changing .status.experiment), copying the post-update ResourceVersion onto the stale in-memory object makes updateStatusIfNeededV2 later deep-copy and Status().Update the old status without getting the conflict that would normally protect that concurrent change. Please merge the terminal experiment phase into the freshly fetched status or use a targeted status patch instead of only transplanting the RV.

Useful? React with 👍 / 👎.

codecov-commenter · 2026-04-24T08:29:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 40.71%. Comparing base (f11b43c) to head (0dadba1).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2930      +/-   ##
==========================================
+ Coverage   40.69%   40.71%   +0.01%     
==========================================
  Files         321      321              
  Lines       28413    28420       +7     
==========================================
+ Hits        11563    11570       +7     
  Misses      16015    16015              
  Partials      835      835

Flag	Coverage Δ
unittests	`40.71% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
internal/controller/datadogagent/experiment.go	`85.92% <100.00%> (+0.76%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f11b43c...0dadba1. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replace the ResourceVersion transplant with a targeted status subresource merge patch that only writes the terminal experiment phase. This addresses the codex review finding: propagating the post-Update ResourceVersion onto the caller's stale in-memory instance caused updateStatusIfNeededV2 to silently overwrite any concurrent status writes made between our initial fetch and our spec Update (e.g. RC daemon updating status.experiment). The targeted patch lands the terminal phase in the same reconcile as the rollback (fixing the original flake) while leaving the full-status write's 409 protection intact for every other status field. The patch is best-effort — a transient failure just defers the phase write to the next reconcile, same as before. Adds Test_Experiment_RollbackPersistsInSingleReconcile regression guard asserting both the spec rollback and phase=timeout persist in a single reconcile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rollback() no longer mutates ResourceVersion, so it doesn't need the typed pointer — revert to the original metav1.ObjectMeta signature for a more minimal diff. restorePreviousSpec keeps the *DatadogAgent signature since it needs the typed object for Status().Patch(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tbavelier · 2026-04-24T08:42:37Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 851f223cbe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T08:46:51Z

+// reconcile will re-compute and retry.
+func (r *Reconciler) patchExperimentPhase(ctx context.Context, instance *v2alpha1.DatadogAgent, phase v2alpha1.ExperimentPhase) {
+	patch := fmt.Appendf(nil, `{"status":{"experiment":{"phase":%q}}}`, phase)
+	if err := r.client.Status().Patch(ctx, instance, client.RawPatch(types.MergePatchType, patch)); err != nil {


Avoid refreshing the status RV before the full status update

When this raw status patch succeeds it decodes the server response back into instance, advancing instance.ResourceVersion even though newStatus was computed from the object fetched at the start of reconcile. If Fleet Automation/RC updates any status field (for example a new experiment ID or phase) between that initial fetch and this patch, the later updateStatusIfNeededV2 full Status().Update will use the patched RV and replace status with stale newStatus instead of 409'ing as before, clobbering the concurrent status write. Patch into a scratch object or refetch/merge before the full status update so other status fields keep their conflict protection.

Useful? React with 👍 / 👎.

controller-runtime's Status().Patch decodes the server response back into the passed object, advancing ResourceVersion. Calling it on the caller's *DatadogAgent instance would let the subsequent full-status write in updateStatusIfNeededV2 use the patched RV and silently overwrite any status field that a concurrent writer (e.g. the RC daemon) updated between our initial reconcile fetch and this patch. Patch into a DeepCopy scratch so instance.ResourceVersion stays at the value observed at reconcile start. The full-status write's 409 protection is retained, and the terminal experiment phase still lands in the same reconcile via the targeted patch. Adds Test_Experiment_RollbackPreservesConcurrentStatusWrite regression guard asserting instance.ResourceVersion is not mutated by restorePreviousSpec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tbavelier · 2026-04-24T08:56:04Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b2aa3b359

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T09:02:40Z

+func (r *Reconciler) patchExperimentPhase(ctx context.Context, instance *v2alpha1.DatadogAgent, phase v2alpha1.ExperimentPhase) {
+	patch := fmt.Appendf(nil, `{"status":{"experiment":{"phase":%q}}}`, phase)
+	scratch := instance.DeepCopy()
+	if err := r.client.Status().Patch(ctx, scratch, client.RawPatch(types.MergePatchType, patch)); err != nil {


Do not overwrite concurrent experiment phase updates

When the RC daemon accepts a promote/stop after this reconcile fetched instance but before this patch runs, this raw merge patch still writes status.experiment.phase from the stale decision without any resourceVersion/phase precondition. For example, pkg/fleet/experiment.go allows promote from running and pkg/fleet/daemon.go persists phase=promoted; this patch can then change that accepted promotion back to timeout/rollback, whereas the previous full status update would have conflicted and retried. That can leave Fleet believing the experiment was promoted while the DDA status/spec reflect a rollback.

Useful? React with 👍 / 👎.

The raw merge patch wrote status.experiment.phase without any precondition. If a concurrent writer (e.g. the RC daemon accepting a promotion at pkg/fleet/daemon.go:354) moved the phase between the reconcile's initial fetch and the restorePreviousSpec patch, the merge patch would silently overwrite their accepted state with our stale terminal-phase decision — potentially flipping an accepted promotion back to timeout/rollback and leaving Fleet's view inconsistent with the DDA. Switch to a JSON patch with a `test` op that asserts the server's current phase still matches the phase the rollback decision was based on. If it doesn't, the test fails, the patch is rejected, and the concurrent writer's state survives — mirroring the 409-and-retry semantics of a full-status Update. Adds Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite regression guard; fixes TestRestorePreviousSpec_PhaseSetOnlyOnSuccess to populate instance.Status.Experiment (required by the new observed- phase capture in restorePreviousSpec). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tbavelier · 2026-04-24T09:17:45Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0deaf5e48d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T09:25:32Z

 		return err
 	}
 	newStatus.Experiment.Phase = terminalPhase
+	r.patchExperimentPhase(ctx, instance, observedPhase, terminalPhase)


Annotate rollback revision before persisting terminal phase

When the operator process exits after this patch succeeds but before the annotateRevision call below runs, the terminal phase has already been persisted. On restart, handleRollback ignores timeout/rollback, so the highest experiment revision never gets annotationExperimentRollback; re-applying the same spec later can hit the stale CreationTimestamp path and immediately time out. Persist the phase only after the annotation attempt, or make terminal-phase reconciles repair a missing rollback annotation.

Useful? React with 👍 / 👎.

The previous order was: patch terminal phase → annotate experiment revision. If the operator crashed between these two writes, the server held a terminal phase but the revision was unannotated. On restart, handleRollback ignores terminal phases (the switch only handles Stopped/Running), so the annotation would never be applied — a later re-apply of the same spec would then hit the stale CreationTimestamp path in findMostRecentMatchingRevision and fire an immediate false timeout. Swap to annotate-then-patch. If a crash happens between them now, the phase is still non-terminal on restart, handleRollback re-fires, annotateRevision is idempotent (early-returns when already annotated), and patchExperimentPhase runs again. Eventually both writes succeed. Also extends Test_Experiment_RollbackDoesNotClobberConcurrentPhaseWrite to assert the annotation lands regardless of whether the phase patch was rejected by the test op — the annotation is protection for future re-applies and must not be gated on phase-write success. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tbavelier · 2026-04-24T09:38:06Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0dadba1e3c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T09:49:26Z

+// Best-effort: any error here (test failure or transient) is logged at V(1)
+// and the next reconcile will re-compute and retry.
+func (r *Reconciler) patchExperimentPhase(ctx context.Context, instance *v2alpha1.DatadogAgent, expectedCurrent, newPhase v2alpha1.ExperimentPhase) {
+	patch := fmt.Appendf(nil, `[{"op":"test","path":"/status/experiment/phase","value":%q},{"op":"replace","path":"/status/experiment/phase","value":%q}]`, expectedCurrent, newPhase)


Guard phase patch with experiment identity

When Fleet moves the same DDA through another transition that returns to the same phase before this patch runs—for example promoting the timed-out experiment and then starting a new one, which records a different status.experiment.id with phase=running—this test still passes because it only checks phase. The raw patch then marks the newer experiment timeout/rollback, whereas the previous RV-protected status update would have conflicted instead of overwriting a newer status; include the observed experiment ID (or the whole experiment status) in the patch precondition.

Useful? React with 👍 / 👎.

tbavelier added this to the v1.26.0 milestone Apr 24, 2026

tbavelier added bug Something isn't working qa/skip-qa labels Apr 24, 2026

tbavelier requested a review from a team April 24, 2026 08:18

github-actions Bot added the team/container-platform label Apr 24, 2026

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

tbavelier and others added 2 commits April 24, 2026 10:32

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

Conversation

tbavelier commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Fix

Additional Notes

Minimum Agent Versions

Describe your test plan

Checklist

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

codecov-commenter commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tbavelier commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

tbavelier commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

tbavelier commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

tbavelier commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tbavelier commented Apr 24, 2026 •

edited

Loading

codecov-commenter commented Apr 24, 2026 •

edited

Loading