Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(nemesis): make nemesis more safely by adding repair on all nodes #10496

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

timtimb0t
Copy link
Contributor

@timtimb0t timtimb0t commented Mar 25, 2025

this change add repair at the beggining of disrupt_repair_streaming_err to make this nemesis more safe and avoid c-s data validation errors ref: scylladb/scylladb#21428

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@timtimb0t timtimb0t added the backport/none Backport is not required label Mar 25, 2025
@timtimb0t timtimb0t force-pushed the validation_after_destroying branch from f28bb16 to ada17ec Compare March 25, 2025 11:02
Copy link
Contributor

@aleksbykov aleksbykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need make the repair method a bit safely

sdcm/nemesis.py Outdated
node.run_nodetool(sub_cmd="repair", publish_event=publish_event)
def repair_nodetool_repair(self, node=None, publish_event=True, all_nodes=False):
if all_nodes:
for node in self.cluster.nodes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use only data nodes self.clsuter.data_nodes. Repair operation could be skipped on zero-token nodes

node = node if node else self.target_node
with adaptive_timeout(Operations.REPAIR, node, timeout=HOUR_IN_SEC * 48):
node.run_nodetool(sub_cmd="repair", publish_event=publish_event)
def repair_nodetool_repair(self, node=None, publish_event=True, all_nodes=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think, you need go over the method usage, and find out whether this method used in Non-disruptive nemesis(methods), so operation is not failed if node is under Disruptive nemesis. Otherwise, you need to make this changes a bit safely by adding ignore status of nodetool command or add try except

@@ -4273,6 +4278,7 @@ def disrupt_repair_streaming_err(self):
"""
Stop repair in middle to trigger some streaming fails, then rebuild the data on the node.
"""
self.repair_nodetool_repair(all_nodes=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a short comment, why we need it here

this change add repair at the beggining of disrupt_repair_streaming_err
to make this nemesis more safe and avoid c-s data validation errors
ref: scylladb/scylladb#21428
@timtimb0t timtimb0t force-pushed the validation_after_destroying branch from ada17ec to 728a8b1 Compare March 26, 2025 11:47
@timtimb0t timtimb0t marked this pull request as ready for review March 26, 2025 11:48
@timtimb0t timtimb0t requested a review from aleksbykov March 26, 2025 11:50
@scylladbbot
Copy link

@timtimb0t new branch manager-3.5 was added, please add backport label if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants