full-scan read timeout issues #9994

yarongilor · 2025-02-04T17:51:20Z

yarongilor
Feb 4, 2025
Collaborator

This discussion is related to #9284
The problem: There are read-timeouts for the full-scan thread during rolling restart nemesis. other tools like c-s and s-b don't experience such issues.
There could be few directions to follow:

improve the full-scan queries retry mechanism to better handle nodes restart.
Check if the CQL connection works as expected.
could it also be a driver issue?

The following scenario was tested in order to prove the cql patient connection works as expected:
master...yarongilor:scylla-cluster-tests:check_qcl_connection

Now this test "basically almost" passed ok.
So we can possibly conclude, for example, that the "node" parameter of the cql-patient-connection is unneeded and confusing.

The unexpected "bug" found in the test code is that i expected :

        for node in original_nodes:
            InfoEvent(f'Start decommissioning node {node.name}').publish()

to remove the 3 original cluster nodes - 1,2, and 3.
But it removed nodes 1,3 and 5 instead.

< t:2025-02-04 16:58:51,191 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2025-02-04 16:58:51.190: (InfoEvent Severity.NORMAL) period_type=not-set event_id=51a225ce-947a-4f1f-b3bc-82001619a0f9: message=Start decommissioning node test-cql-connection-master-db-node-c5073719-1
< t:2025-02-04 16:59:29,435 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2025-02-04 16:59:29.432: (InfoEvent Severity.NORMAL) period_type=not-set event_id=c664085b-32fc-40e4-a32e-89065e7f923f: message=Adding a new node..
< t:2025-02-04 17:01:28,254 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2025-02-04 17:01:28.252: (InfoEvent Severity.NORMAL) period_type=not-set event_id=a9cbb3d6-9855-4c9a-a01b-e18ecebe3684: message=Start decommissioning node test-cql-connection-master-db-node-c5073719-3
< t:2025-02-04 17:02:05,288 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2025-02-04 17:02:05.286: (InfoEvent Severity.NORMAL) period_type=not-set event_id=7f06fbb4-34c1-4d35-a76e-b6111896cd50: message=Adding a new node..
< t:2025-02-04 17:04:22,664 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2025-02-04 17:04:22.659: (InfoEvent Severity.NORMAL) period_type=not-set event_id=24c220c1-749d-4604-9241-7aed7e6edafd: message=Start decommissioning node test-cql-connection-master-db-node-c5073719-5
< t:2025-02-04 17:04:58,152 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2025-02-04 17:04:58.150: (InfoEvent Severity.NORMAL) period_type=not-set event_id=bc3fcb50-fcdb-446c-a6f4-95d467c3bbb4: message=Adding a new node..

Argus link

As a next step, we can think of a minimal reproducer for a rolling restart nemesis + background full-scan queries.

yarongilor · 2025-02-04T17:57:33Z

yarongilor
Feb 4, 2025
Collaborator Author

cc: @pehala ,@temichus , @aleksbykov , @fruch , @roydahan

0 replies

yarongilor · 2025-02-05T14:25:23Z

yarongilor
Feb 5, 2025
Collaborator Author

This also looks wrong - full-scan report a successful run on a decommissioned node:

< t:2025-02-05 13:48:33,804 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2025-02-05 13:48:33.802: (DisruptionEvent Severity.NORMAL) period_type=end event_id=bfdc1654-0aa2-449c-92e4-bf07e0435e30 duration=6m10s: nemesis_name=NodetoolSeedDecommission target_node=Node longevity-large-partitions-8h-maste-db-node-1a880fdc-1 [52.50.26.71 | 10.4.1.33]
ubuntu@ip-10-4-1-83:~/sct-results/latest$ grep "operation ended successfully" sct.log | tail -n 1
< t:2025-02-05 14:20:56,361 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2025-02-05 14:20:56.359: (FullPartitionScanEvent Severity.NORMAL) period_type=end event_id=b38efc70-b810-432a-b82c-a33236ca4be6 duration=0s node=longevity-large-partitions-8h-maste-db-node-1a880fdc-1 select_from=scylla_bench.test message=FullPartitionScanOperation operation ended successfully

1 reply

pehala Feb 5, 2025
Collaborator

This also looks wrong - full-scan report a successful run on a decommissioned node:

This support the hypothesis that it doesn't matter which node you choose, it connects to a cluster not a specific node, but in that case the event is misleading and should not mention the node that was "targetted"

yarongilor · 2025-02-26T14:26:29Z

yarongilor
Feb 26, 2025
Collaborator Author

@dkropachev , following scylladb/scylladb#22911 resolution, could this issue also be related to scylladb/cassandra-stress#30 ?

0 replies

fruch · 2025-02-26T21:22:50Z

fruch
Feb 26, 2025
Maintainer

C-S is not using the retry machinzem of the driver (Also Scylla bench and latte)
Anyhow after retries were interoduced for fullscans, there are still failures ?

It might be a driver issue, the retry being used i.e the exponential backoffice one,
Is some new, and I think it's the first time we are trying it, I would recommend looking at the driver logs to see there were any retires.

Also, why are discussing it here and not in an issue ?

1 reply

yarongilor Feb 27, 2025
Collaborator Author

C-S is not using the retry machinzem of the driver (Also Scylla bench and latte) Anyhow after retries were interoduced for fullscans, there are still failures ?

yes, see for example, https://argus.scylladb.com/tests/scylla-cluster-tests/f6cd5bb6-c0b8-4638-951a-a39d945010d2

It might be a driver issue, the retry being used i.e the exponential backoffice one, Is some new, and I think it's the first time we are trying it, I would recommend looking at the driver logs to see there were any retires.

ok, where can i look for the driver logs?

Also, why are discussing it here and not in an issue ?

Since we're not sure for the root cause and right repository for this error.
There's a fix in sct, but it still reproduces, there's an open issue in java-driver repo, and there's also an enterprise issue in https://github.com/scylladb/scylla-enterprise/issues/5109
No problem moving this to any of these repos if it sounds better.
@pehala , we can also consider playing some more with the exponential-backoff retry policy.
BTW, this issue might turn invisible due to #9987.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

full-scan read timeout issues #9994

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

full-scan read timeout issues #9994

yarongilor Feb 4, 2025 Collaborator

Replies: 4 comments · 2 replies

yarongilor Feb 4, 2025 Collaborator Author

yarongilor Feb 5, 2025 Collaborator Author

pehala Feb 5, 2025 Collaborator

yarongilor Feb 26, 2025 Collaborator Author

fruch Feb 26, 2025 Maintainer

yarongilor Feb 27, 2025 Collaborator Author

yarongilor
Feb 4, 2025
Collaborator

Replies: 4 comments 2 replies

yarongilor
Feb 4, 2025
Collaborator Author

yarongilor
Feb 5, 2025
Collaborator Author

pehala Feb 5, 2025
Collaborator

yarongilor
Feb 26, 2025
Collaborator Author

fruch
Feb 26, 2025
Maintainer

yarongilor Feb 27, 2025
Collaborator Author