Description
Packages
Scylla version: 2025.2.0~dev-20250301.0343235aa269
with build-id cdacab4180fe76c82a6aa1b48154d56318c29c93
Kernel Version: 6.8.0-1023-aws
Issue description
- This issue is a regression.
- It is unknown if this issue is a regression.
The Build HDR histogram summary is intermittently freezing, as seen in the recent test after the second node's decommissioning. Looks like it's stuck in sdcm.utils.hdrhistogram._HdrRangeHistogramBuilder._build_histogram_from_dir
during file iteration.
for hdr_file in hdr_files:
if os.stat(hdr_file).st_size == 0:
LOGGER.error("File %s is empty", hdr_file)
continue
file_range_histogram = self._build_histogram_from_file(hdr_file, hdr_tag)
if file_range_histogram:
collected_histograms.append(file_range_histogram)
This is from log :
< t:2025-03-02 12:24:09,801 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2025-03-02 12:24:09.799: (InfoEvent Severity.NORMAL) period_type=not-set event_id=0b16e1a2-2c65-440f-93e0-fab5de58c84e: message=FinishEvent - ShrinkCluster has done decommissioning 1 nodes
< t:2025-03-02 12:24:41,360 f:decorators.py l:267 c:sdcm.utils.decorators p:DEBUG > hdr: [{'WRITE': {'start_time': 1740915920999.0, 'end_time': 1740916521000.0, 'stddev': 'percentile_99_9': 9.51, 'percentile_99_99': 36.77, 'percentile_99_999': 40.86, 'throughput': 17528}}]
< t:2025-03-02 12:24:41,370 f:tester.py l:3816 c:PerformanceRegressionTest p:INFO > Build HDR histogram (tags: ['WRITE-rt', 'READ-rt']) with start time: 1740915916.0573575, end time: 1740918249.8044257; for operation: mixed
< t:2025-03-02 12:24:51,824 f:hdrhistogram.py l:243 c:/home/ubuntu/scylla-cluster-tests/sdcm/utils/hdrhistogram.py p:DEBUG > The file '/home/ubuntu/sct-results/20250302-040951-961039/perf-latency-nemesis-ubuntu-loader-set-d492be8f/perf-latency-nemesis-ubuntu-loader-node-d492be8f-1/hdrh-cs-write-l1-c0-k1-b0d9fe7d-1a6c-4c1e-82fa-cce0605690ae.hdr' does not include the time interval from `1740915916.0573575` to `1740918249.8044257`
< t:2025-03-02 12:25:15,127 f:commit_log_check_thread.py l:205 c:CommitLogCheckThread p:DEBUG > overflow_commit_log_directory: []
< t:2025-03-02 12:25:15,132 f:commit_log_check_thread.py l:214 c:CommitLogCheckThread p:DEBUG > zero_free_segments: []
< t:2025-03-02 12:35:15,145 f:commit_log_check_thread.py l:205 c:CommitLogCheckThread p:DEBUG > overflow_commit_log_directory: []
< t:2025-03-02 12:35:15,155 f:commit_log_check_thread.py l:214 c:CommitLogCheckThread p:DEBUG > zero_free_segments: []
< t:2025-03-02 12:45:15,160 f:commit_log_check_thread.py l:205 c:CommitLogCheckThread p:DEBUG > overflow_commit_log_directory: []
We need more print statements for debugging.
Impact
Test stucks and timed out after 2 days
How frequently does it reproduce?
I saw that few times
Installation details
Cluster size: 3 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
- perf-latency-nemesis-ubuntu-db-node-d492be8f-7 (34.205.246.108 | 10.12.1.104) (shards: 7)
- perf-latency-nemesis-ubuntu-db-node-d492be8f-6 (44.199.201.208 | 10.12.1.75) (shards: 7)
- perf-latency-nemesis-ubuntu-db-node-d492be8f-5 (44.195.1.81 | 10.12.1.41) (shards: 7)
- perf-latency-nemesis-ubuntu-db-node-d492be8f-4 (3.236.155.65 | 10.12.2.157) (shards: 7)
- perf-latency-nemesis-ubuntu-db-node-d492be8f-3 (18.215.186.105 | 10.12.2.113) (shards: 7)
- perf-latency-nemesis-ubuntu-db-node-d492be8f-2 (18.213.193.244 | 10.12.0.225) (shards: -1)
- perf-latency-nemesis-ubuntu-db-node-d492be8f-1 (18.209.237.195 | 10.12.0.160) (shards: 7)
OS / Image: ami-09240a6402ddbb8ce
(aws: undefined_region)
Test: scylla-enterprise-perf-regression-latency-650gb-with-nemesis-rbno-disabled
Test id: d492be8f-f802-416d-bf31-dc236166c832
Test name: scylla-enterprise/perf-regression/scylla-enterprise-perf-regression-latency-650gb-with-nemesis-rbno-disabled
Test method: performance_regression_test.PerformanceRegressionTest.test_latency_mixed_with_nemesis
Test config file(s):
Logs and commands
- Restore Monitor Stack command:
$ hydra investigate show-monitor d492be8f-f802-416d-bf31-dc236166c832
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs d492be8f-f802-416d-bf31-dc236166c832
Logs:
- perf-latency-nemesis-ubuntu-db-node-d492be8f-6 - https://cloudius-jenkins-test.s3.amazonaws.com/d492be8f-f802-416d-bf31-dc236166c832/20250302_040950/perf-latency-nemesis-ubuntu-db-node-d492be8f-6-d492be8f.tar.gz
- perf-latency-nemesis-ubuntu-db-node-d492be8f-2 - https://cloudius-jenkins-test.s3.amazonaws.com/d492be8f-f802-416d-bf31-dc236166c832/20250302_040950/perf-latency-nemesis-ubuntu-db-node-d492be8f-2-d492be8f.tar.gz
- perf-latency-nemesis-ubuntu-db-node-d492be8f-1 - https://cloudius-jenkins-test.s3.amazonaws.com/d492be8f-f802-416d-bf31-dc236166c832/20250302_040950/perf-latency-nemesis-ubuntu-db-node-d492be8f-1-d492be8f.tar.gz
- perf-latency-nemesis-ubuntu-db-node-d492be8f-4 - https://cloudius-jenkins-test.s3.amazonaws.com/d492be8f-f802-416d-bf31-dc236166c832/20250302_040950/perf-latency-nemesis-ubuntu-db-node-d492be8f-4-d492be8f.tar.gz