test(e2e): decouple ScyllaDB Manager task property update verification from backup, restore, and repair tests#3470
Conversation
|
@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik. Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
963e4c1 to
2d69432
Compare
|
/auto-cc |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: czeslavo, rzetelskik The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test images |
Cluster provisioning failed. |
|
/retest |
Description of your changes:
The e2e-gke-parallel check in #3461 failed on a Manager-related test. Ultimately, the failure is caused by scylladb/scylla-manager#4564: if a task fails and is scheduled for retry, a subsequent
PutTaskcall - even one that doesn't change the schedule - cancels the pending retry. The task never reruns and the test times out waiting for completion. I couldn't identify the root cause of the initial task failure in a reasonable timeframe, but the investigation was sufficient to say it wasn't a regression from the integration perspective; the upstream bug is tracked in CLOUD-2276.From an integration testing perspective, verifying that an update mid-run triggers a retry brings no value - it exercises a scylla-manager scheduler edge case rather than operator behaviour. I changed the workflow to: create task -> wait for completion -> update task -> wait for completion. That sequence sidesteps the race entirely.
As part of the same change I decoupled task update verification from task deletion. The tests for "delete repair task" and "disable manager integration" now live in a separate
DescribeTable, independent of the update test. The same restructuring is applied to theScyllaDBManagerTasksingle-DC and multi-DC suites, and the backup task update verification is removed from the object storage suite (the update path is covered by the repair task suite and theScyllaDBManagerTasksuites). The shared cluster setup (create, rollout, CQL data insertion, manager registration) is moved intoJustBeforeEachso all sub-tests reuse it without duplication.These changes should make future flakes easier to isolate.
Which issue is resolved by this Pull Request:
Resolves https://scylladb.atlassian.net/browse/OPERATOR-140
/kind flake
/priority important-soon
/cc