Skip to content

scylla-bench hangs when a node is restarted, after connection refused #170

Open
@cezarmoise

Description

@cezarmoise

During nemesis disrupt_rolling_config_change_internode_compression, all nodes were restarted, but when the target node 10.12.2.4 was restarted, scylla-bench got a connection refused error, and did not continue.

2025-03-24 15:42:51.456: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=c638422a-fc92-4f10-9956-b99ff7686265 duration=3h5m0s: node=Node longevity-twcs-3h-2025-1-loader-node-c853738d-1 [3.236.202.116 | 10.12.2.120]
stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=4000 -clustering-row-count=10000 -clustering-row-size=200 -concurrency=100 -rows-per-request=1 -start-timestamp=1742819871003026357 -connection-count 100 -max-rate 30000 --timeout 120s -retry-number=30 -retry-interval=80ms,1s -duration=170m -nodes 10.12.0.6,10.12.2.4,10.12.1.251,10.12.3.2,10.12.3.81
errors:
Stress command execution failed with: Command did not complete within 11100 seconds!
Command: "sudo docker exec 01af5379cc1a9972304859727be047b8862f15198d5b1333a485236169c2db75 /bin/sh -c 'scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=4000 -clustering-row-count=10000 -clustering-row-size=200 -concurrency=100 -rows-per-request=1 -start-timestamp=1742819871003026357 -connection-count 100 -max-rate 30000 --timeout 120s -retry-number=30 -retry-interval=80ms,1s -duration=170m -nodes 10.12.0.6,10.12.2.4,10.12.1.251,10.12.3.2,10.12.3.81'"
Stdout:
1h21m39s 30001 30001 0 8.6ms 3.6ms 2.6ms 1.3ms 1.1ms 786µs 829µs
1h21m40s 30000 30000 0 13ms 8.6ms 2.4ms 1.2ms 1ms 754µs 827µs
1h21m41s 30000 30000 0 11ms 4.5ms 2.7ms 2ms 1.1ms 786µs 871µs
1h21m42s 30000 30000 0 6.4ms 4.1ms 2.8ms 1.2ms 1ms 754µs 832µs
1h21m43s 30000 30000 0 6.3ms 2.3ms 1.3ms 1ms 983µs 754µs 761µs
1h21m44s 30000 30000 0 12ms 3.9ms 1.4ms 1.1ms 983µs 754µs 768µs
1h21m45s 30000 30000 0 9.2ms 3.8ms 2.4ms 1.4ms 1.1ms 754µs 834µs
1h21m46s 30000 30000 0 4.7ms 3.5ms 2.7ms 2.1ms 1.1ms 754µs 872µs
1h21m47s 30000 30000 0 8.3ms 4.6ms 2.8ms 1.1ms 1ms 754µs 810µs
1h21m48s 30000 30000 0 9.2ms 4.2ms 1.4ms 1.1ms 983µs 754µs 773µs
Stderr:
2025/03/24 13:59:40 gocql: unable to dial control conn 10.12.2.4:9042: dial tcp 10.12.2.4:9042: connect: connection refused

11100 seconds is the duration of 170 minutes + 15 minutes extra timeout added by SCT.

Scylla-bench log ends with this

1h21m48s   30000   30000      0 9.2ms  4.2ms  1.4ms  1.1ms  983µs  754µs  773µs  
2025/03/24 13:59:40 gocql: unable to dial control conn 10.12.2.4:9042: dial tcp 10.12.2.4:9042: connect: connection refused

The node was restarted just then
Node log

Mar 24 13:59:40.745246 longevity-twcs-3h-2025-1-db-node-c853738d-2 systemd[1]: Stopping scylla-server.service - Scylla Server...

sct log, multiple logs were restarted, but only on that node there were issues

Line 188318: < t:2025-03-24 13:59:13,627 f:nemesis.py      l:1056 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-1 [3.92.87.153 | 10.12.0.6]
Line 193269: < t:2025-03-24 13:59:40,223 f:nemesis.py      l:1056 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-2 [34.200.253.220 | 10.12.2.4]
Line 197904: < t:2025-03-24 14:00:06,770 f:nemesis.py      l:1056 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-3 [100.27.15.197 | 10.12.1.251]
Line 202367: < t:2025-03-24 14:00:33,382 f:nemesis.py      l:1056 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-4 [3.238.105.201 | 10.12.3.2]
Line 205738: < t:2025-03-24 14:00:54,434 f:nemesis.py      l:1056 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Restarting node Node longevity-twcs-3h-2025-1-db-node-c853738d-5 [44.223.76.189 | 10.12.3.81]

Packages

Scylla version: 2025.1.0~rc4-20250323.bc983017832c with build-id 088ceb686f4b2d57120be368a4b86d6ac0a04cd5
Kernel Version: 6.8.0-1024-aws

Installation details

Cluster size: 5 nodes (i4i.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-3h-2025-1-db-node-c853738d-6 (3.218.143.226 | 10.12.1.229) (shards: -1)
  • longevity-twcs-3h-2025-1-db-node-c853738d-5 (44.223.76.189 | 10.12.3.81) (shards: 7)
  • longevity-twcs-3h-2025-1-db-node-c853738d-4 (3.238.105.201 | 10.12.3.2) (shards: 7)
  • longevity-twcs-3h-2025-1-db-node-c853738d-3 (100.27.15.197 | 10.12.1.251) (shards: 7)
  • longevity-twcs-3h-2025-1-db-node-c853738d-2 (34.200.253.220 | 10.12.2.4) (shards: 7)
  • longevity-twcs-3h-2025-1-db-node-c853738d-1 (3.92.87.153 | 10.12.0.6) (shards: 7)

OS / Image: ami-0cb84a63946021a33 (aws: undefined_region)

Test: longevity-twcs-3h-test
Test id: c853738d-b379-456d-b6c3-3257ecd27d86
Test name: scylla-2025.1/longevity/longevity-twcs-3h-test
Test method: longevity_twcs_test.TWCSLongevityTest.test_twcs_longevity
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor c853738d-b379-456d-b6c3-3257ecd27d86
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs c853738d-b379-456d-b6c3-3257ecd27d86

Logs:

Jenkins job URL
Argus

Metadata

Metadata

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions