Skip to content

health check fails in check_group0_tokenring_consistency if one of the nodes is down  #9599

Open
@yarongilor

Description

@yarongilor

Packages

Scylla version: 2024.3.0~dev-20241218.42cc7a4f12de with build-id 0ee8a26c08783c18bd6dead5ba27a9e622efa885

Kernel Version: 6.8.0-1021-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.
node-1 gets no-space-left (following a core dump:

DatabaseLogEvent
ERROR
disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back
2024-12-19 20:46:23.828
Received: 2024-12-19 20:46:23.804
one-time
elasticity-test-nemesis-master-db-node-90bfa08f-1
2024-12-19 20:46:23.828 <2024-12-19 20:46:23.804>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=14c23b02-53e1-4bfa-b976-a93a9abb930c: type=NO_SPACE_ERROR regex=No space left on device line_number=10491 node=elasticity-test-nemesis-master-db-node-90bfa08f-1
2024-12-19T20:46:23.804+00:00 elasticity-test-nemesis-master-db-node-90bfa08f-1      !ERR | scylla[10468]:  [shard 0:main] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)
0x1f26dfd

The node fails to restart scylla service as well:

Nemesis Information
Class: Sisyphus
Name: disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back
Status: Failed
Failure reason
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5446, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1032, in disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back
    update_authenticator(self.cluster.nodes, orig_auth)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 2432, in update_authenticator
    node.restart_scylla_server()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2535, in restart_scylla_server
    self.restart_service(service_name='scylla-server', timeout=timeout * 2)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2461, in restart_service
    self._service_cmd(service_name=service_name, cmd='restart', timeout=timeout, ignore_status=ignore_status)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2445, in _service_cmd
    return self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 653, in run
    result = _run()
  File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 72, in inner
    return func(*args, **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 644, in _run
    return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 577, in _run_execute
    result = connection.run(**command_kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 625, in run
    return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 660, in _complete_run
    raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'sudo systemctl restart scylla-server.service'

Exit code: 1

Stdout:



Stderr:

Job for scylla-server.service failed because the control process exited with error code.
See "systemctl status scylla-server.service" and "journalctl -xeu scylla-server.service" for details.

The nemesis thread cannot get host id and fails permanently (the test keeps running without nemesis):

ThreadFailedEvent
ERROR
no nemesis
2024-12-19 21:06:47.838
one-time
2024-12-19 21:06:47.838: (ThreadFailedEvent Severity.ERROR) period_type=one-time event_id=281ad081-a1a5-4cd9-ab8e-6bc2f7bc949b: message='NoneType' object has no attribute 'get'
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/decorators.py", line 26, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 487, in run
self.disrupt()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 6643, in disrupt
self.call_next_nemesis()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2054, in call_next_nemesis
self.execute_disrupt_method(disrupt_method=next(self.disruptions_cycle))
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1980, in execute_disrupt_method
disrupt_method()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5504, in wrapper
args[0].cluster.check_cluster_health()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4388, in check_cluster_health
node.check_node_health()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2738, in check_node_health
event = next(events, None)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/raft/__init__.py", line 331, in check_group0_tokenring_consistency
self._node.name, self._node.host_id)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 483, in host_id
return self.parent_cluster.get_nodetool_info(self, ignore_status=True, publish_event=False).get("ID")
AttributeError: 'NoneType' object has no attribute 'get'

The get_nodetool_info probably failed to run, in this state, and returned None, so the get() failed.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

  • elasticity-test-nemesis-master-db-node-90bfa08f-3 (54.217.164.63 | 10.4.14.82) (shards: 2)
  • elasticity-test-nemesis-master-db-node-90bfa08f-2 (34.246.246.186 | 10.4.14.242) (shards: 2)
  • elasticity-test-nemesis-master-db-node-90bfa08f-1 (54.228.89.37 | 10.4.14.150) (shards: 2)

OS / Image: ami-0f14cab4bda57c2b2 (aws: undefined_region)

Test: byo-longevity-test-yg2
Test id: 90bfa08f-2a3d-4ba9-b443-13ce00925638
Test name: scylla-staging/yarongilor/byo-longevity-test-yg2
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 90bfa08f-2a3d-4ba9-b443-13ce00925638
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 90bfa08f-2a3d-4ba9-b443-13ce00925638

Logs:

Jenkins job URL
Argus

Metadata

Metadata

Assignees

Labels

area/elastic cloudIssues related to the elastic cloud project

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions