fix(disrupt_terminate_and_replace_node): raise critical event on failure #10403

cezarmoise · 2025-03-13T14:32:33Z

If the nemesis cannot leave the cluster in the topological state it was before it should raise a critical error so the test can be stopped.

Add new event for topology failures TopologyFailureEvent.

refs: #9918

Testing

[ ]

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

If the nemesis cannot leave the cluster in the topological state it was before it should raise a critical error so the test can be stopped. Add new event for topology failures `TopologyFailureEvent` refs: scylladb#9918

roydahan · 2025-03-13T15:18:18Z

Why is it needed?

cezarmoise · 2025-03-13T19:20:20Z

Why is it needed?

By continuing, it leads to issues with other nemesis that affect topology, like GrowShrinkCluster.

fruch · 2025-03-13T21:09:35Z

Another option is to refactor this nemesis to make sure we move to the part adding node, regardless of what was failing.

Either way, we can't accept a nemesis removing a node and not adding a new node

soyacz

@cezarmoise you look at wrong place and try to fix wrong issue.
In your test: https://argus.scylladb.com/tests/scylla-cluster-tests/0b0e042d-60a7-4dad-832d-4e38f2e5a5e9
see that longevity-parallel-topology-schema--db-node-0b0e042d-6 was already decommissioned during disrupt_decommission_streaming_err nemesis and we never got back to initial cluster node count - somehow SCT didn't detect problem with missing node. Please investigate this issue and propose fix for that.

pehala · 2025-05-09T08:13:15Z

@cezarmoise what is the future of this PR? Do you plan on continuing with this PR or can it be closed

cezarmoise added backport/2024.2 Need backport to 2024.2 backport/2025.1 labels Mar 13, 2025

cezarmoise requested a review from a team March 13, 2025 14:32

cezarmoise self-assigned this Mar 13, 2025

soyacz requested changes Mar 14, 2025

View reviewed changes

pehala marked this pull request as draft April 3, 2025 08:50

scylladbbot added the backport/2025.2 label May 7, 2025

cezarmoise closed this May 9, 2025

cezarmoise deleted the fix-terminate-and-replace branch June 12, 2025 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(disrupt_terminate_and_replace_node): raise critical event on failure #10403

fix(disrupt_terminate_and_replace_node): raise critical event on failure #10403

Uh oh!

cezarmoise commented Mar 13, 2025

Uh oh!

roydahan commented Mar 13, 2025

Uh oh!

cezarmoise commented Mar 13, 2025 •

edited

Loading

Uh oh!

fruch commented Mar 13, 2025

Uh oh!

soyacz left a comment

Uh oh!

pehala commented May 9, 2025

Uh oh!

Uh oh!

fix(disrupt_terminate_and_replace_node): raise critical event on failure #10403

fix(disrupt_terminate_and_replace_node): raise critical event on failure #10403

Uh oh!

Conversation

cezarmoise commented Mar 13, 2025

Testing

PR pre-checks (self review)

Reminders

Uh oh!

roydahan commented Mar 13, 2025

Uh oh!

cezarmoise commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fruch commented Mar 13, 2025

Uh oh!

soyacz left a comment

Choose a reason for hiding this comment

Uh oh!

pehala commented May 9, 2025

Uh oh!

Uh oh!

cezarmoise commented Mar 13, 2025 •

edited

Loading