Skip to content

test: improve reliability of flaky "persists messages" system test #2152

@amaanx86

Description

@amaanx86

Is your feature request related to a problem? Please describe.

The Persistence / persists messages system test is known to be flaky (discussed in #2151). Two distinct failure modes have been identified:

  1. SpecTimeout race on slow clusters (e.g. min k8s v1.29.12): The spec has a 3-minute SpecTimeout that covers the full lifecycle — BeforeEach (cluster creation + waitForRabbitmqRunning) and the It body (waitForRabbitmqUpdate after pod deletion). On slower environments, or when the mutating webhook adds a round-trip to every r.Update() call during reconcile, the pod-restart cycle inside waitForRabbitmqUpdate regularly exceeds what is left of the budget after cluster creation, causing a timeout.

  2. 409 AlreadyExists on FlakeAttempts retry: When FlakeAttempts(3) retries the spec, the AfterEach deletion of the previous cluster may not have completed by the time BeforeEach runs again. The createRabbitmqCluster call fails immediately with 409 AlreadyExists, so the retry itself fails before the test body even runs — making FlakeAttempts ineffective.

Describe the solution you'd like

Two targeted changes to test/system/system_test.go:

  1. Wrap cluster creation in BeforeEach with Eventually so that on a FlakeAttempts retry the setup polls until the previous cluster is fully gone before proceeding:
Eventually(func(g Gomega) {
    cluster = newRabbitmqCluster(namespace, "persistence-rabbit")
    g.Expect(createRabbitmqCluster(ctx, rmqClusterClient, cluster)).To(Succeed())
}, podCreationTimeout, 5*time.Second).Should(Succeed())
  1. Increase SpecTimeout from 3 minutes to 5 minutes on the persists messages It node, consistent with other slower specs in the same file (e.g. the TLS spec uses SpecTimeout(time.Minute*5)), to give the reconcile loop sufficient headroom when webhook round-trips are in the path.

Describe alternatives you've considered

  • Delete the test entirely — mentioned in feat: move defaulting logic to mutating webhook #2151 as an option; avoids the flake but loses coverage of the persistence guarantee.
  • Set failurePolicy: Ignore on the mutating webhook — reduces webhook-induced latency during reconcile but changes production behaviour and is a broader change than necessary.
  • Increase FlakeAttempts — does not address either root cause; the 409 AlreadyExists still makes retries fail immediately.

Additional context

Discussed in #2151 (comment by @Zerpet): #2151 (comment)

The failure is consistently reproducible on the "Local system tests (min k8s)" CI job (k8s v1.29.12) and was not observed on stable k8s prior to the mutating webhook being introduced in that PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions