Description
During nemesis disrupt_rolling_config_change_internode_compression
, all nodes were restarted, but when the target node 10.12.2.4
was restarted, scylla-bench got a connection refused error, and did not continue.
2025-03-24 15:42:51.456: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=c638422a-fc92-4f10-9956-b99ff7686265 duration=3h5m0s: node=Node longevity-twcs-3h-2025-1-loader-node-c853738d-1 [3.236.202.116 | 10.12.2.120]
stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=4000 -clustering-row-count=10000 -clustering-row-size=200 -concurrency=100 -rows-per-request=1 -start-timestamp=1742819871003026357 -connection-count 100 -max-rate 30000 --timeout 120s -retry-number=30 -retry-interval=80ms,1s -duration=170m -nodes 10.12.0.6,10.12.2.4,10.12.1.251,10.12.3.2,10.12.3.81
errors:
Stress command execution failed with: Command did not complete within 11100 seconds!
Command: "sudo docker exec 01af5379cc1a9972304859727be047b8862f15198d5b1333a485236169c2db75 /bin/sh -c 'scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=4000 -clustering-row-count=10000 -clustering-row-size=200 -concurrency=100 -rows-per-request=1 -start-timestamp=1742819871003026357 -connection-count 100 -max-rate 30000 --timeout 120s -retry-number=30 -retry-interval=80ms,1s -duration=170m -nodes 10.12.0.6,10.12.2.4,10.12.1.251,10.12.3.2,10.12.3.81'"
Stdout:
1h21m39s 30001 30001 0 8.6ms 3.6ms 2.6ms 1.3ms 1.1ms 786µs 829µs
1h21m40s 30000 30000 0 13ms 8.6ms 2.4ms 1.2ms 1ms 754µs 827µs
1h21m41s 30000 30000 0 11ms 4.5ms 2.7ms 2ms 1.1ms 786µs 871µs
1h21m42s 30000 30000 0 6.4ms 4.1ms 2.8ms 1.2ms 1ms 754µs 832µs
1h21m43s 30000 30000 0 6.3ms 2.3ms 1.3ms 1ms 983µs 754µs 761µs
1h21m44s 30000 30000 0 12ms 3.9ms 1.4ms 1.1ms 983µs 754µs 768µs
1h21m45s 30000 30000 0 9.2ms 3.8ms 2.4ms 1.4ms 1.1ms 754µs 834µs
1h21m46s 30000 30000 0 4.7ms 3.5ms 2.7ms 2.1ms 1.1ms 754µs 872µs
1h21m47s 30000 30000 0 8.3ms 4.6ms 2.8ms 1.1ms 1ms 754µs 810µs
1h21m48s 30000 30000 0 9.2ms 4.2ms 1.4ms 1.1ms 983µs 754µs 773µs
Stderr:
2025/03/24 13:59:40 gocql: unable to dial control conn 10.12.2.4:9042: dial tcp 10.12.2.4:9042: connect: connection refused
11100 seconds is the duration of 170 minutes + 15 minutes extra timeout added by SCT.
Scylla-bench log ends with this
1h21m48s 30000 30000 0 9.2ms 4.2ms 1.4ms 1.1ms 983µs 754µs 773µs
2025/03/24 13:59:40 gocql: unable to dial control conn 10.12.2.4:9042: dial tcp 10.12.2.4:9042: connect: connection refused
The node was restarted just then
Node log
Mar 24 13:59:40.745246 longevity-twcs-3h-2025-1-db-node-c853738d-2 systemd[1]: Stopping scylla-server.service - Scylla Server...
sct log, multiple logs were restarted, but only on that node there were issues
Line 188318: < t:2025-03-24 13:59:13,627 f:nemesis.py l:1056 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-1 [3.92.87.153 | 10.12.0.6]
Line 193269: < t:2025-03-24 13:59:40,223 f:nemesis.py l:1056 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-2 [34.200.253.220 | 10.12.2.4]
Line 197904: < t:2025-03-24 14:00:06,770 f:nemesis.py l:1056 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-3 [100.27.15.197 | 10.12.1.251]
Line 202367: < t:2025-03-24 14:00:33,382 f:nemesis.py l:1056 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-4 [3.238.105.201 | 10.12.3.2]
Line 205738: < t:2025-03-24 14:00:54,434 f:nemesis.py l:1056 c:sdcm.nemesis p:INFO > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-5 [44.223.76.189 | 10.12.3.81]
Packages
Scylla version: 2025.1.0~rc4-20250323.bc983017832c
with build-id 088ceb686f4b2d57120be368a4b86d6ac0a04cd5
Kernel Version: 6.8.0-1024-aws
Installation details
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
- longevity-twcs-3h-2025-1-db-node-c853738d-6 (3.218.143.226 | 10.12.1.229) (shards: -1)
- longevity-twcs-3h-2025-1-db-node-c853738d-5 (44.223.76.189 | 10.12.3.81) (shards: 7)
- longevity-twcs-3h-2025-1-db-node-c853738d-4 (3.238.105.201 | 10.12.3.2) (shards: 7)
- longevity-twcs-3h-2025-1-db-node-c853738d-3 (100.27.15.197 | 10.12.1.251) (shards: 7)
- longevity-twcs-3h-2025-1-db-node-c853738d-2 (34.200.253.220 | 10.12.2.4) (shards: 7)
- longevity-twcs-3h-2025-1-db-node-c853738d-1 (3.92.87.153 | 10.12.0.6) (shards: 7)
OS / Image: ami-0cb84a63946021a33
(aws: undefined_region)
Test: longevity-twcs-3h-test
Test id: c853738d-b379-456d-b6c3-3257ecd27d86
Test name: scylla-2025.1/longevity/longevity-twcs-3h-test
Test method: longevity_twcs_test.TWCSLongevityTest.test_twcs_longevity
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor c853738d-b379-456d-b6c3-3257ecd27d86
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs c853738d-b379-456d-b6c3-3257ecd27d86
Logs:
- longevity-twcs-3h-2025-1-db-node-c853738d-6 - https://cloudius-jenkins-test.s3.amazonaws.com/c853738d-b379-456d-b6c3-3257ecd27d86/20250324_123101/longevity-twcs-3h-2025-1-db-node-c853738d-6-c853738d.tar.zst
- db-cluster-c853738d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/c853738d-b379-456d-b6c3-3257ecd27d86/20250324_154704/db-cluster-c853738d.tar.zst
- sct-runner-events-c853738d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/c853738d-b379-456d-b6c3-3257ecd27d86/20250324_154704/sct-runner-events-c853738d.tar.zst
- sct-c853738d.log.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/c853738d-b379-456d-b6c3-3257ecd27d86/20250324_154704/sct-c853738d.log.tar.zst
- loader-set-c853738d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/c853738d-b379-456d-b6c3-3257ecd27d86/20250324_154704/loader-set-c853738d.tar.zst
- monitor-set-c853738d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/c853738d-b379-456d-b6c3-3257ecd27d86/20250324_154704/monitor-set-c853738d.tar.zst
- parallel-timelines-report-c853738d.tar.zst - https://cloudius-jenkins-test.s3.amazonaws.com/c853738d-b379-456d-b6c3-3257ecd27d86/20250324_154704/parallel-timelines-report-c853738d.tar.zst
- builder-c853738d.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/c853738d-b379-456d-b6c3-3257ecd27d86/upload_20250324_154828/builder-c853738d.log.tar.gz