Skip to content

SB raised critical event while attempt to run load on the removed node #171

Open
@timtimb0t

Description

@timtimb0t

Packages

Scylla version: 2025.1.0~rc4-20250323.bc983017832c with build-id 088ceb686f4b2d57120be368a4b86d6ac0a04cd5

Kernel Version: 6.8.0-1024-aws

Issue description

SB generated following critical event:

2025-03-25 00:46:27.003: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=fbbe6055-b1f7-485e-8bc2-94ec977b651f duration=12h10m0s: node=Node longevity-twcs-48h-2025-1-loader-node-4b3cfc69-3 [18.207.92.193 | 10.12.9.109]
stress_cmd=scylla-bench -workload=timeseries -mode=read -partition-count=4000 -concurrency=150 -replication-factor=3 -clustering-row-count=10000 -clustering-row-size=200  -rows-per-request=1 -start-timestamp=1742819763459321842 -write-rate 200 -distribution uniform --connection-count 100 -duration=720m -timeout=30s -retry-number=30 -retry-interval=80ms,1s -nodes 10.12.8.142,10.12.11.100,10.12.11.39,10.12.8.35
errors:
Stress command execution failed with: Command did not complete within 43800 seconds!
Command: "sudo  docker exec dc379d85867fcaf9adb599284c98705c92aca0b7dde4ddef6255b6ab67841c27 /bin/sh -c 'scylla-bench -workload=timeseries -mode=read -partition-count=4000 -concurrency=150 -replication-factor=3 -clustering-row-count=10000 -clustering-row-size=200  -rows-per-request=1 -start-timestamp=1742819763459321842 -write-rate 200 -distribution uniform --connection-count 100 -duration=720m -timeout=30s -retry-number=30 -retry-interval=80ms,1s -nodes 10.12.8.142,10.12.11.100,10.12.11.39,10.12.8.35'"
Stdout:
3h19m52s   74920       0      0 19ms   11ms   5.9ms  4.4ms  3.7ms  1.7ms  2ms
3h19m53s   70237       0      0 22ms   11ms   5.9ms  4.4ms  3.9ms  1.9ms  2.1ms
3h19m54s   73280       0      0 24ms   9.2ms  5.8ms  4.3ms  3.6ms  1.8ms  2ms
3h19m55s   83978       0      0 17ms   9.7ms  5.7ms  3.9ms  3.2ms  1.4ms  1.8ms
3h19m56s   84773       0      0 20ms   10ms   5.9ms  4ms    3.1ms  1.4ms  1.8ms
3h19m57s   84805       0      0 24ms   11ms   5.7ms  3.8ms  3.1ms  1.4ms  1.8ms
3h19m58s   83197       0      0 24ms   12ms   6.1ms  4ms    3.2ms  1.4ms  1.8ms
3h19m59s   84793       0      0 23ms   11ms   6ms    3.9ms  3.1ms  1.4ms  1.8ms
3h20m0s   83973       0      0 21ms   10ms   6ms    3.9ms  3.2ms  1.4ms  1.8ms
3h20m1s   87626       0      0 14ms   8.3ms  4.5ms  3.3ms  2.9ms  1.4ms  1.7ms
Stderr:
2025/03/24 13:55:34 gocql: unable to dial control conn 10.12.11.39:9042: dial tcp 10.12.11.39:9042: connect: connection refused
2025/03/24 15:56:29 gocql: unable to dial control conn 10.12.11.100:9042: dial tcp 10.12.11.100:9042: connect: connection refused
2025/03/24 17:55:02 error: failed to connect to "[HostInfo hostname=\"10.12.10.15\" connectAddress=\"10.12.10.15\" peer=\"10.12.10.15\" rpc_address=\"10.12.10.15\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.12.10.15\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-east\" rack=\"1c\" host_id=\"95ead47a-461b-4a5b-9e42-d730eb16b0b8\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response received from cassandra within timeout period (potentially executed: true)
2025/03/24 17:56:05 error: failed to connect to "[HostInfo hostname=\"10.12.10.15\" connectAddress=\"10.12.10.15\" peer=\"10.12.10.15\" rpc_address=\"10.12.10.15\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.12.10.15\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-east\" rack=\"1c\" host_id=\"95ead47a-461b-4a5b-9e42-d730eb16b0b8\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response received from cassandra within timeout period (potentially executed: true)
2025/03/24 17:57:05 error: failed to connect to "[HostInfo hostname=\"10.12.10.15\" connectAddress=\"10.12.10.15\" peer=\"10.12.10.15\" rpc_address=\"10.12.10.15\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.12.10.15\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-east\" rack=\"1c\" host_id=\"95ead47a-461b-4a5b-9e42-d730eb16b0b8\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response received from cassandra within timeout period (potentially executed: true)
2025/03/24 17:58:02 error: failed to connect to "[HostInfo hostname=\"10.12.10.15\" connectAddress=\"10.12.10.15\" peer=\"10.12.10.15\" rpc_address=\"10.12.10.15\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.12.10.15\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-east\" rack=\"1c\" host_id=\"95ead47a-461b-4a5b-9e42-d730eb16b0b8\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response received from cassandra within timeout period (potentially executed: true)
2025/03/24 17:59:05 error: failed to connect to "[HostInfo hostname=\"10.12.10.15\" connectAddress=\"10.12.10.15\" peer=\"10.12.10.15\" rpc_address=\"10.12.10.15\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.12.10.15\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-east\" rack=\"1c\" host_id=\"95ead47a-461b-4a5b-9e42-d730eb16b0b8\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response received from cassandra within timeout period (potentially executed: true)
2025/03/24 18:00:05 error: failed to connect to "[HostInfo hostname=\"10.12.10.15\" connectAddress=\"10.12.10.15\" peer=\"10.12.10.15\" rpc_address=\"10.12.10.15\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.12.10.15\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-east\" rack=\"1c\" host_id=\"95ead47a-461b-4a5b-9e42-d730eb16b0b8\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response received from cassandra within timeout period (potentially executed: true)
2025/03/24 18:01:05 error: failed to connect to "[HostInfo hostname=\"10.12.10.15\" connectAddress=\"10.12.10.15\" peer=\"10.12.10.15\" rpc_address=\"10.12.10.15\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.12.10.15\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-east\" rack=\"1c\" host_id=\"95ead47a-461b-4a5b-9e42-d730eb16b0b8\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response received from cassandra within timeout period (potentially executed: true)
2025/03/24 18:02:05 error: failed to connect to "[HostInfo hostname=\"10.12.10.15\" connectAddress=\"10.12.10.15\" peer=\"10.12.10.15\" rpc_address=\"10.12.10.15\" broadcast_address=\"<nil>\" preferred_ip=\"<nil>\" connect_addr=\"10.12.10.15\" connect_addr_source=\"connect_address\" port=9042 data_centre=\"us-east\" rack=\"1c\" host_id=\"95ead47a-461b-4a5b-9e42-d730eb16b0b8\" version=\"v3.0.8\" state=DOWN num_tokens=256]" due to error: gocql: no response received from cassandra within timeout period (potentially executed: true)

Node 10.12.10.15 was a target node for one of previously executed nemeses and was removed from the cluster but SB tried to proceed to execute cql commands on it

Impact

No explicit impact

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • longevity-twcs-48h-2025-1-db-node-4b3cfc69-9 (54.237.128.252 | 10.12.8.252) (shards: 7)
  • longevity-twcs-48h-2025-1-db-node-4b3cfc69-8 (18.234.45.247 | 10.12.11.84) (shards: 7)
  • longevity-twcs-48h-2025-1-db-node-4b3cfc69-7 (34.227.86.176 | 10.12.11.59) (shards: 7)
  • longevity-twcs-48h-2025-1-db-node-4b3cfc69-6 (18.212.213.116 | 10.12.10.15) (shards: 7)
  • longevity-twcs-48h-2025-1-db-node-4b3cfc69-5 (54.92.197.116 | 10.12.9.151) (shards: 7)
  • longevity-twcs-48h-2025-1-db-node-4b3cfc69-4 (3.89.162.251 | 10.12.8.35) (shards: 7)
  • longevity-twcs-48h-2025-1-db-node-4b3cfc69-3 (54.144.251.188 | 10.12.11.39) (shards: 7)
  • longevity-twcs-48h-2025-1-db-node-4b3cfc69-2 (54.160.168.112 | 10.12.11.100) (shards: 7)
  • longevity-twcs-48h-2025-1-db-node-4b3cfc69-1 (98.84.173.76 | 10.12.8.142) (shards: 7)

OS / Image: ami-0cb84a63946021a33 (aws: undefined_region)

Test: longevity-twcs-48h-test
Test id: 4b3cfc69-57fc-4295-83f6-2411eb2b33bd
Test name: scylla-2025.1/tier1/longevity-twcs-48h-test
Test method: longevity_twcs_test.TWCSLongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 4b3cfc69-57fc-4295-83f6-2411eb2b33bd
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 4b3cfc69-57fc-4295-83f6-2411eb2b33bd

Logs:

Jenkins job URL
Argus

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions