Skip to content

AddRemoveDC nemesis fails when trying to decommission the only node in dc #10052

Open
@jsmolar

Description

When decommissioning a node, three keyspaces are present: 'keyspace1', 'keyspace_new_dc', and 'scylla_bench'. The decommission process is executed on keyspace_new_dc and us-east_nemesis_dc. However, us-east_nemesis_dc contains only one node, and the keyspace in this DC has a replication factor (RF) of 1. This makes decommissioning the node impossible, as there are no candidate nodes to receive the data.

logs:

2025-02-04 22:06:46.905: (DisruptionEvent Severity.ERROR) period_type=end event_id=8e34dc1b-0b98-44d9-b28e-2e4702c12970 duration=54m16s: nemesis_name=AddRemoveDc target_node=Node longevity-200gb-48h-verify-limited--db-node-775d283e-2 [98.81.100.146 | 10.12.3.79] errors=Encountered a bad command exit code!
Command: "/usr/bin/nodetool -u cassandra -pw 'cassandra'  decommission "
Exit code: 4
Stdout:
Stderr:
error executing POST request to http://localhost:10000/storage_service/decommission with parameters {}: remote replied with status code 500 Internal Server Error:
std::runtime_error (Decommission failed. See earlier errors (Rolled back: Failed to drain tablets: std::runtime_error (There are nodes with tablets to drain but no candidate nodes in DC us-east_nemesis_dc. Consider adding new nodes or reducing replication factor.)). Request ID: e966e93a-e343-11ef-09a0-b72664f26659)
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5501, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4863, in disrupt_add_remove_dc
self.cluster.decommission(new_node)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 5078, in decommission
node.run_nodetool("decommission", timeout=timeout, long_running=True, retry=0)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2664, in run_nodetool
runner(cmd, timeout=timeout, ignore_status=ignore_status, verbose=verbose, retry=retry)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_long_running.py", line 67, in run_long_running_cmd
raise UnexpectedExit(result=result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: "/usr/bin/nodetool -u cassandra -pw 'cassandra'  decommission "
Exit code: 4
Stdout:
Stderr:
error executing POST request to http://localhost:10000/storage_service/decommission with parameters {}: remote replied with status code 500 Internal Server Error:
std::runtime_error (Decommission failed. See earlier errors (Rolled back: Failed to drain tablets: std::runtime_error (There are nodes with tablets to drain but no candidate nodes in DC us-east_nemesis_dc. Consider adding new nodes or reducing replication factor.)). Request ID: e966e93a-e343-11ef-09a0-b72664f26659)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

BugSomething isn't working right

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions