Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(nemesis): add support ipv6 for refuse connection for banned node #10594

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aleksbykov
Copy link
Contributor

@aleksbykov aleksbykov commented Apr 6, 2025

disrupt_refuse_connection_with_* nemesises doesn't support ipv6.

  • Added command for blocking ports for ipv6 stack.

When node is banned and alive, c-s/s-b could connect to it
and failed with critical error, because banned node return
that other node cluster is down.

  • Added new node_operation: block_loader_workload_for_scyllanode.

This allow to block connections to scylla node from loaders
and avoid critical error of c-s/s-b if them connect to
banned node and failed to run

Fixes: #10434

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@soyacz
Copy link
Contributor

soyacz commented Apr 7, 2025

Isn't it just a Scylla issue? Even under high load, shouldn't other nodes know that one is down within 10 minutes?

@aleksbykov aleksbykov force-pushed the fix-10434-increase-timeout-to-wait-down branch 8 times, most recently from 912fb11 to a06b7e1 Compare April 11, 2025 12:24
@aleksbykov aleksbykov changed the title fix(nemesis): increase timeout waiting node down fix(nemesis): add support ipv6 for refuse connection for banned node Apr 11, 2025
@aleksbykov
Copy link
Contributor Author

Isn't it just a Scylla issue? Even under high load, shouldn't other nodes know that one is down within 10 minutes?

i found the problem. It was not in timeout, it was related to ipv6.

@aleksbykov aleksbykov marked this pull request as ready for review April 11, 2025 12:32
@aleksbykov
Copy link
Contributor Author

Additional staging job is running

target_node.log.debug("Send signal SIGSTOP to scylla process on node %s", target_node.name)
target_node.remoter.sudo("pkill --signal SIGSTOP -e scylla", timeout=60)
yield
target_node.log.debug("Send signal SIGCONT to scylla process on node %s", target_node.name)
target_node.remoter.sudo(cmd="pkill --signal SIGCONT -e scylla", timeout=60)


@contextlib.contextmanager
def block_loaders_payload_for_scylla_node(scylla_node: BaseNode, loader_nodes: list[BaseNode]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docstring why this is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -48,4 +76,6 @@ def is_node_removed_from_cluster(removed_node: BaseNode, verification_node: Base

def is_node_seen_as_down(down_node: BaseNode, verification_node: BaseNode) -> bool:
LOGGER.debug("Verification node %s", verification_node.name)
return down_node not in verification_node.parent_cluster.get_nodes_up_and_normal(verification_node)
nodes_status = verification_node.parent_cluster.get_nodetool_status(verification_node, dc_aware=False)
down_node_status = nodes_status.get(down_node.ip_address)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouln't use down_node.listen_address?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be the same as ip_address and we use it everywhere with nodetool_status

disrupt_refuse_connection_with_* nemesises doesn't support ipv6.
 - Added command for blocking ports for ipv6 stack.

When node is banned and alive, c-s/s-b could connect to it
and failed with critical error, because banned node return
that other node cluster is down.
 - Added new node_operation: block_loader_workload_for_scyllanode.
 This allow to block connections to scylla node from loaders
 and aboid critical error of c-s/s-b if them connect to
 banned node and failed to run

Fixes: scylladb#10434
@aleksbykov aleksbykov force-pushed the fix-10434-increase-timeout-to-wait-down branch from a06b7e1 to 2c4e417 Compare April 11, 2025 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need to increase wait timeout for disrupt_refuse_connection_with_block_scylla_ports_on_banned_node nemesis
2 participants