You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The Persistence / persists messages system test is known to be flaky (discussed in #2151). Two distinct failure modes have been identified:
SpecTimeout race on slow clusters (e.g. min k8s v1.29.12): The spec has a 3-minute SpecTimeout that covers the full lifecycle — BeforeEach (cluster creation + waitForRabbitmqRunning) and the It body (waitForRabbitmqUpdate after pod deletion). On slower environments, or when the mutating webhook adds a round-trip to every r.Update() call during reconcile, the pod-restart cycle inside waitForRabbitmqUpdate regularly exceeds what is left of the budget after cluster creation, causing a timeout.
409 AlreadyExists on FlakeAttempts retry: When FlakeAttempts(3) retries the spec, the AfterEach deletion of the previous cluster may not have completed by the time BeforeEach runs again. The createRabbitmqCluster call fails immediately with 409 AlreadyExists, so the retry itself fails before the test body even runs — making FlakeAttempts ineffective.
Describe the solution you'd like
Two targeted changes to test/system/system_test.go:
Wrap cluster creation in BeforeEach with Eventually so that on a FlakeAttempts retry the setup polls until the previous cluster is fully gone before proceeding:
Increase SpecTimeout from 3 minutes to 5 minutes on the persists messages It node, consistent with other slower specs in the same file (e.g. the TLS spec uses SpecTimeout(time.Minute*5)), to give the reconcile loop sufficient headroom when webhook round-trips are in the path.
Set failurePolicy: Ignore on the mutating webhook — reduces webhook-induced latency during reconcile but changes production behaviour and is a broader change than necessary.
Increase FlakeAttempts — does not address either root cause; the 409 AlreadyExists still makes retries fail immediately.
The failure is consistently reproducible on the "Local system tests (min k8s)" CI job (k8s v1.29.12) and was not observed on stable k8s prior to the mutating webhook being introduced in that PR.
Is your feature request related to a problem? Please describe.
The
Persistence / persists messagessystem test is known to be flaky (discussed in #2151). Two distinct failure modes have been identified:SpecTimeoutrace on slow clusters (e.g. min k8s v1.29.12): The spec has a 3-minuteSpecTimeoutthat covers the full lifecycle —BeforeEach(cluster creation +waitForRabbitmqRunning) and theItbody (waitForRabbitmqUpdateafter pod deletion). On slower environments, or when the mutating webhook adds a round-trip to everyr.Update()call during reconcile, the pod-restart cycle insidewaitForRabbitmqUpdateregularly exceeds what is left of the budget after cluster creation, causing a timeout.409 AlreadyExists on
FlakeAttemptsretry: WhenFlakeAttempts(3)retries the spec, theAfterEachdeletion of the previous cluster may not have completed by the timeBeforeEachruns again. ThecreateRabbitmqClustercall fails immediately with409 AlreadyExists, so the retry itself fails before the test body even runs — makingFlakeAttemptsineffective.Describe the solution you'd like
Two targeted changes to
test/system/system_test.go:BeforeEachwithEventuallyso that on aFlakeAttemptsretry the setup polls until the previous cluster is fully gone before proceeding:SpecTimeoutfrom 3 minutes to 5 minutes on thepersists messagesIt node, consistent with other slower specs in the same file (e.g. the TLS spec usesSpecTimeout(time.Minute*5)), to give the reconcile loop sufficient headroom when webhook round-trips are in the path.Describe alternatives you've considered
failurePolicy: Ignoreon the mutating webhook — reduces webhook-induced latency during reconcile but changes production behaviour and is a broader change than necessary.FlakeAttempts— does not address either root cause; the 409 AlreadyExists still makes retries fail immediately.Additional context
Discussed in #2151 (comment by @Zerpet): #2151 (comment)
The failure is consistently reproducible on the "Local system tests (min k8s)" CI job (k8s v1.29.12) and was not observed on stable k8s prior to the mutating webhook being introduced in that PR.