Description
Packages
Scylla version: 2024.3.0~dev-20241218.42cc7a4f12de
with build-id 0ee8a26c08783c18bd6dead5ba27a9e622efa885
Kernel Version: 6.8.0-1021-aws
Issue description
- This issue is a regression.
- It is unknown if this issue is a regression.
Describe your issue in detail and steps it took to produce it.
node-1 gets no-space-left (following a core dump:
DatabaseLogEvent
ERROR
disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back
2024-12-19 20:46:23.828
Received: 2024-12-19 20:46:23.804
one-time
elasticity-test-nemesis-master-db-node-90bfa08f-1
2024-12-19 20:46:23.828 <2024-12-19 20:46:23.804>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=14c23b02-53e1-4bfa-b976-a93a9abb930c: type=NO_SPACE_ERROR regex=No space left on device line_number=10491 node=elasticity-test-nemesis-master-db-node-90bfa08f-1
2024-12-19T20:46:23.804+00:00 elasticity-test-nemesis-master-db-node-90bfa08f-1 !ERR | scylla[10468]: [shard 0:main] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)
0x1f26dfd
The node fails to restart scylla service as well:
Nemesis Information
Class: Sisyphus
Name: disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back
Status: Failed
Failure reason
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5446, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1032, in disrupt_switch_between_password_authenticator_and_saslauthd_authenticator_and_back
update_authenticator(self.cluster.nodes, orig_auth)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 2432, in update_authenticator
node.restart_scylla_server()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2535, in restart_scylla_server
self.restart_service(service_name='scylla-server', timeout=timeout * 2)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2461, in restart_service
self._service_cmd(service_name=service_name, cmd='restart', timeout=timeout, ignore_status=ignore_status)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2445, in _service_cmd
return self.remoter.run(cmd, timeout=timeout, ignore_status=ignore_status)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 653, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 72, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 644, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 577, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 625, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 660, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'sudo systemctl restart scylla-server.service'
Exit code: 1
Stdout:
Stderr:
Job for scylla-server.service failed because the control process exited with error code.
See "systemctl status scylla-server.service" and "journalctl -xeu scylla-server.service" for details.
The nemesis thread cannot get host id and fails permanently (the test keeps running without nemesis):
ThreadFailedEvent
ERROR
no nemesis
2024-12-19 21:06:47.838
one-time
2024-12-19 21:06:47.838: (ThreadFailedEvent Severity.ERROR) period_type=one-time event_id=281ad081-a1a5-4cd9-ab8e-6bc2f7bc949b: message='NoneType' object has no attribute 'get'
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/decorators.py", line 26, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 487, in run
self.disrupt()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 6643, in disrupt
self.call_next_nemesis()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 2054, in call_next_nemesis
self.execute_disrupt_method(disrupt_method=next(self.disruptions_cycle))
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 1980, in execute_disrupt_method
disrupt_method()
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5504, in wrapper
args[0].cluster.check_cluster_health()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 4388, in check_cluster_health
node.check_node_health()
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 2738, in check_node_health
event = next(events, None)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/raft/__init__.py", line 331, in check_group0_tokenring_consistency
self._node.name, self._node.host_id)
File "/home/ubuntu/scylla-cluster-tests/sdcm/cluster.py", line 483, in host_id
return self.parent_cluster.get_nodetool_info(self, ignore_status=True, publish_event=False).get("ID")
AttributeError: 'NoneType' object has no attribute 'get'
The get_nodetool_info
probably failed to run, in this state, and returned None, so the get()
failed.
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 3 nodes (i4i.large)
Scylla Nodes used in this run:
- elasticity-test-nemesis-master-db-node-90bfa08f-3 (54.217.164.63 | 10.4.14.82) (shards: 2)
- elasticity-test-nemesis-master-db-node-90bfa08f-2 (34.246.246.186 | 10.4.14.242) (shards: 2)
- elasticity-test-nemesis-master-db-node-90bfa08f-1 (54.228.89.37 | 10.4.14.150) (shards: 2)
OS / Image: ami-0f14cab4bda57c2b2
(aws: undefined_region)
Test: byo-longevity-test-yg2
Test id: 90bfa08f-2a3d-4ba9-b443-13ce00925638
Test name: scylla-staging/yarongilor/byo-longevity-test-yg2
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor 90bfa08f-2a3d-4ba9-b443-13ce00925638
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 90bfa08f-2a3d-4ba9-b443-13ce00925638
Logs:
- core.scylla-elasticity-test-nemesis-master-db-node-90bfa08f-1-2024-12-19_20-03-42.gz - https://storage.cloud.google.com/upload.scylladb.com/core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000./core.systemd-network.998.93c36ffde86d40fcae8d05712785715b.489.1734638559000000.zst