Skip to content

fix(disrupt_terminate_and_replace_node): raise critical event on failure #10403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cezarmoise
Copy link
Contributor

If the nemesis cannot leave the cluster in the topological state it was before it should raise a critical error so the test can be stopped.

Add new event for topology failures TopologyFailureEvent.

refs: #9918

Testing

  • [ ]

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

If the nemesis cannot leave the cluster in the topological state it was before
it should raise a critical error so the test can be stopped.

Add new event for topology failures `TopologyFailureEvent`

refs: scylladb#9918
@cezarmoise cezarmoise added backport/2024.2 Need backport to 2024.2 backport/2025.1 labels Mar 13, 2025
@cezarmoise cezarmoise requested a review from a team March 13, 2025 14:32
@cezarmoise cezarmoise self-assigned this Mar 13, 2025
@roydahan
Copy link
Contributor

Why is it needed?

@cezarmoise
Copy link
Contributor Author

cezarmoise commented Mar 13, 2025

Why is it needed?

By continuing, it leads to issues with other nemesis that affect topology, like GrowShrinkCluster.

@fruch
Copy link
Contributor

fruch commented Mar 13, 2025

Another option is to refactor this nemesis to make sure we move to the part adding node, regardless of what was failing.

Either way, we can't accept a nemesis removing a node and not adding a new node

Copy link
Contributor

@soyacz soyacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cezarmoise you look at wrong place and try to fix wrong issue.
In your test: https://argus.scylladb.com/tests/scylla-cluster-tests/0b0e042d-60a7-4dad-832d-4e38f2e5a5e9
see that longevity-parallel-topology-schema--db-node-0b0e042d-6 was already decommissioned during disrupt_decommission_streaming_err nemesis and we never got back to initial cluster node count - somehow SCT didn't detect problem with missing node. Please investigate this issue and propose fix for that.

@pehala pehala marked this pull request as draft April 3, 2025 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/2024.2 Need backport to 2024.2 backport/2025.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants