Description
Issue description
- This issue is a regression.
- It is unknown if this issue is a regression.
I am not 100% certain this problem is caused by the manager, but it seems in all the occurrences it happened while one of the manager's tasks was running (maybe this is just a coincidence though).
If you think it has nothing to do with the manager please move this issue to a different repo (I guess - to scylla-enteprise)
In the problematic longevity build we created a new Cloud cluster in Staging and triggered the following nemeses on it:
2023-08-04 08:17:20
-2023-08-04 08:52:20
- disrupt_rolling_restart_cluster2023-08-04 09:23:01
-2023-08-04 23:34:46
- disrupt_mgmt_backup
In parallel the load with the following config was running:
prepare_write_cmd:
- "cassandra-stress write cl=QUORUM n=10485760 -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=30 -pop seq=1..10485760 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"
stress_cmd:
- "cassandra-stress write cl=QUORUM duration=1400m -schema 'replication(factor=3) compaction (strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=10 -pop seq=1..10485760 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"
- "cassandra-stress read cl=QUORUM duration=1400m -schema 'replication(factor=3) compaction (strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=10 -pop seq=1..10485760 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"
During the disrupt_mgmt_backup nemesis, the backup task was started via sctool
< t:2023-08-04 09:23:37,080 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/opt/scylla/scylla-cloud -c /opt/scylla/scylla-dbaas.yml cluster manager sctool --cluster-id 11232 -- backup --location AWS_EU_WEST_1:s3:scylla-cloud-backup-11232-11016-v80c7h"...
While the task was running the storage utilization on the node scylla-cloud-operations-24h-devel-db-node-f04b5124-1 (172.18.122.20) spiked and eventually hit 100%. The node threw the error No space left on device
:
2023-08-04 17:44:31.526 <2023-08-04 17:44:31.162>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=c1b5b322-9df8-41e9-ab26-fbb18cd9d7f1: type=NO_SPACE_ERROR regex=No space left on device line_number=7273 node=scylla-cloud-operations-24h-devel-db-node-f04b5124-1
2023-08-04T17:44:31.162 ip-172-18-122-20 !ERROR | scylla[24388]: [shard 0] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)
Here are the storage utilization graphs from the monitor during the whole build (red line is for the problematic node)
Some other graphs from the monitor
Impact
One of the cluster nodes runs out of the storage space and stops responding
How frequently does it reproduce?
It reproduced a few times in a row. The other reproducers will be added in the following comments
Installation details
Kernel Version: 5.15.0-1039-aws
Scylla version (or git commit hash): 2022.2.11-20230705.27d29485de90
with build-id f467a0ad8869d61384d8bbc8f20e4fb8fd281f4b
Cloud manager version: 3.1.0
Cluster size: 6 nodes (i4i.large)
Scylla Nodes used in this run:
- scylla-cloud-operations-24h-devel-db-node-f04b5124-6 (null | 172.18.120.40) (shards: 2)
- scylla-cloud-operations-24h-devel-db-node-f04b5124-5 (null | 172.18.120.12) (shards: 2)
- scylla-cloud-operations-24h-devel-db-node-f04b5124-4 (null | 172.18.121.241) (shards: 2)
- scylla-cloud-operations-24h-devel-db-node-f04b5124-3 (null | 172.18.122.54) (shards: 2)
- scylla-cloud-operations-24h-devel-db-node-f04b5124-2 (null | 172.18.121.23) (shards: 2)
- scylla-cloud-operations-24h-devel-db-node-f04b5124-1 (null | 172.18.122.20) (shards: 2)
OS / Image: `` (aws: undefined_region)
Test: scylla-cloud-longevity-terraform-operations-24h-aws
Test id: f04b5124-5beb-4019-b3c1-3559c4726f7d
Test name: siren-tests/longevity-tests/staging/scylla-cloud-longevity-terraform-operations-24h-aws
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor f04b5124-5beb-4019-b3c1-3559c4726f7d
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs f04b5124-5beb-4019-b3c1-3559c4726f7d
Logs:
- db-cluster-f04b5124.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f04b5124-5beb-4019-b3c1-3559c4726f7d/20230805_001510/db-cluster-f04b5124.tar.gz
- sct-runner-events-f04b5124.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f04b5124-5beb-4019-b3c1-3559c4726f7d/20230805_001510/sct-runner-events-f04b5124.tar.gz
- sct-f04b5124.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f04b5124-5beb-4019-b3c1-3559c4726f7d/20230805_001510/sct-f04b5124.log.tar.gz
- loader-set-f04b5124.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f04b5124-5beb-4019-b3c1-3559c4726f7d/20230805_001510/loader-set-f04b5124.tar.gz
- monitor-set-f04b5124.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f04b5124-5beb-4019-b3c1-3559c4726f7d/20230805_001510/monitor-set-f04b5124.tar.gz
- siren-manager-set-f04b5124.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/f04b5124-5beb-4019-b3c1-3559c4726f7d/20230805_001510/siren-manager-set-f04b5124.tar.gz