test: improve reliability of flaky "persists messages" system test

**Is your feature request related to a problem? Please describe.**

The `Persistence / persists messages` system test is known to be flaky (discussed in #2151). Two distinct failure modes have been identified:

1. **`SpecTimeout` race on slow clusters (e.g. min k8s v1.29.12):** The spec has a 3-minute `SpecTimeout` that covers the full lifecycle — `BeforeEach` (cluster creation + `waitForRabbitmqRunning`) and the `It` body (`waitForRabbitmqUpdate` after pod deletion). On slower environments, or when the mutating webhook adds a round-trip to every `r.Update()` call during reconcile, the pod-restart cycle inside `waitForRabbitmqUpdate` regularly exceeds what is left of the budget after cluster creation, causing a timeout.

2. **409 AlreadyExists on `FlakeAttempts` retry:** When `FlakeAttempts(3)` retries the spec, the `AfterEach` deletion of the previous cluster may not have completed by the time `BeforeEach` runs again. The `createRabbitmqCluster` call fails immediately with `409 AlreadyExists`, so the retry itself fails before the test body even runs — making `FlakeAttempts` ineffective.

**Describe the solution you'd like**

Two targeted changes to `test/system/system_test.go`:

1. **Wrap cluster creation in `BeforeEach` with `Eventually`** so that on a `FlakeAttempts` retry the setup polls until the previous cluster is fully gone before proceeding:

```go
Eventually(func(g Gomega) {
    cluster = newRabbitmqCluster(namespace, "persistence-rabbit")
    g.Expect(createRabbitmqCluster(ctx, rmqClusterClient, cluster)).To(Succeed())
}, podCreationTimeout, 5*time.Second).Should(Succeed())
```

2. **Increase `SpecTimeout` from 3 minutes to 5 minutes** on the `persists messages` It node, consistent with other slower specs in the same file (e.g. the TLS spec uses `SpecTimeout(time.Minute*5)`), to give the reconcile loop sufficient headroom when webhook round-trips are in the path.

**Describe alternatives you've considered**

- **Delete the test entirely** — mentioned in #2151 as an option; avoids the flake but loses coverage of the persistence guarantee.
- **Set `failurePolicy: Ignore` on the mutating webhook** — reduces webhook-induced latency during reconcile but changes production behaviour and is a broader change than necessary.
- **Increase `FlakeAttempts`** — does not address either root cause; the 409 AlreadyExists still makes retries fail immediately.

**Additional context**

Discussed in #2151 (comment by @Zerpet): https://github.com/rabbitmq/cluster-operator/pull/2151#issuecomment-4574039429

The failure is consistently reproducible on the "Local system tests (min k8s)" CI job (k8s v1.29.12) and was not observed on stable k8s prior to the mutating webhook being introduced in that PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: improve reliability of flaky "persists messages" system test #2152

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

test: improve reliability of flaky "persists messages" system test #2152

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions