Skip to content

[Backport 2025.4] fix(nemesis): skip disrupt_refuse_connection nemesis when target is alone in rack#14976

Merged
pehala merged 1 commit into
scylladb:branch-2025.4from
scylladbbot:backport/13965/to-2025.4
Jun 11, 2026
Merged

[Backport 2025.4] fix(nemesis): skip disrupt_refuse_connection nemesis when target is alone in rack#14976
pehala merged 1 commit into
scylladb:branch-2025.4from
scylladbbot:backport/13965/to-2025.4

Conversation

@scylladbbot

Copy link
Copy Markdown

The nemeses disrupt_refuse_connection_with_block_scylla_ports_on_banned_node and disrupt_refuse_connection_with_send_sigstop_signal_to_scylla_on_banned_node` need to remove the target node at some point, and if that node is the only one in the rack, that is not allowed.

Example of failure:
https://argus.scylladb.com/tests/scylla-cluster-tests/f4267f80-cf89-4611-8d74-58f0592f33f9

2026-03-11 00:14:39.759: (DisruptionEvent Severity.ERROR) period_type=end event_id=d15192fb-c14d-4a82-903e-d67140a1cfc2 duration=6m42s: nemesis_name=IsolateNodeWithIptableRuleNemesis target_node=Node elasticity-test-nemesis-master-db-node-f4267f80-3 [18.203.101.143 | 10.4.2.41] (Type: i8g.large) (rack: RACK2) errors=Node was not removed properly (Node status:{'state': 'DN', 'load': '381.20GB', 'tokens': '256', 'owns': '?', 'host_id': 'be4a0f28-65a5-47b0-a922-cf9a5df22211', 'rack': 'RACK2'})
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis/__init__.py", line 6084, in _refuse_connection_from_banned_node
    working_node.run_nodetool(f"removenode {target_host_id}", retry=0, long_running=True)
    ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 3100, in run_nodetool
    result = runner(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_long_running.py", line 70, in run_long_running_cmd
    raise UnexpectedExit(result=result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: '/usr/bin/nodetool  removenode be4a0f28-65a5-47b0-a922-cf9a5df22211 '
Exit code: 4
Stdout:
Stderr:
error executing POST request to http://localhost:10000/storage_service/remove_node with parameters {"host_id": ["be4a0f28-65a5-47b0-a922-cf9a5df22211"]}: remote replied with status code 500 Internal Server Error:
std::runtime_error (Removenode failed: node remove rejected: Cannot remove the node because its removal would make some existing keyspace RF-rack-invalid)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis/__init__.py", line 6344, in wrapper
    result = method(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis/monkey/__init__.py", line 987, in disrupt
    self.runner.disrupt_refuse_connection_with_block_scylla_ports_on_banned_node()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis/__init__.py", line 5982, in disrupt_refuse_connection_with_block_scylla_ports_on_banned_node
    self._refuse_connection_from_banned_node(use_iptables=True)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis/__init__.py", line 6034, in _refuse_connection_from_banned_node
    ExitStack() as stack,
    ~~~~~~~~~^^
  File "/usr/local/lib/python3.14/contextlib.py", line 619, in __exit__
    raise exc
  File "/usr/local/lib/python3.14/contextlib.py", line 604, in __exit__
    if cb(*exc_details):
       ~~^^^^^^^^^^^^^^
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis/__init__.py", line 6043, in _finalizer
    self._remove_node_add_node(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^
        verification_node=working_node,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        node_to_remove=self.target_node,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        remove_node_host_id=target_host_id,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis/__init__.py", line 4126, in _remove_node_add_node
    assert removed_node_status is None, "Node was not removed properly (Node status:{})".format(removed_node_status)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Node was not removed properly (Node status:{'state': 'DN', 'load': '381.20GB', 'tokens': '256', 'owns': '?', 'host_id': 'be4a0f28-65a5-47b0-a922-cf9a5df22211', 'rack': 'RACK2'})

Testing

  • [ ]

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)

  • Add unit tests to cover my changes (under unit-test/ folder)

  • Update the Readme/doc folder relevant to this change (if needed)

  • (cherry picked from commit c0656a6)

Parent PR: #13965

…lone in rack

The nemeses `disrupt_refuse_connection_with_block_scylla_ports_on_banned_node` and
`disrupt_refuse_connection_with_send_sigstop_signal_to_scylla_on_banned_node` need to remove
the target node at some point, and if that node is the only one in the rack, that is not allowed.

Added a helper function to check if a node is alone in its rack.

(cherry picked from commit c0656a6)
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@pehala pehala merged commit 21eb65f into scylladb:branch-2025.4 Jun 11, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants