Skip to content

The Build HDR histogram summary is intermittently freezing #10262

Open
@juliayakovlev

Description

@juliayakovlev

Packages

Scylla version: 2025.2.0~dev-20250301.0343235aa269 with build-id cdacab4180fe76c82a6aa1b48154d56318c29c93
Kernel Version: 6.8.0-1023-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

The Build HDR histogram summary is intermittently freezing, as seen in the recent test after the second node's decommissioning. Looks like it's stuck in sdcm.utils.hdrhistogram._HdrRangeHistogramBuilder._build_histogram_from_dir during file iteration.

        for hdr_file in hdr_files:
            if os.stat(hdr_file).st_size == 0:
                LOGGER.error("File %s is empty", hdr_file)
                continue

            file_range_histogram = self._build_histogram_from_file(hdr_file, hdr_tag)
            if file_range_histogram:
                collected_histograms.append(file_range_histogram)

This is from log :

< t:2025-03-02 12:24:09,801 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2025-03-02 12:24:09.799: (InfoEvent Severity.NORMAL) period_type=not-set event_id=0b16e1a2-2c65-440f-93e0-fab5de58c84e: message=FinishEvent - ShrinkCluster has done decommissioning 1 nodes

< t:2025-03-02 12:24:41,360 f:decorators.py   l:267  c:sdcm.utils.decorators p:DEBUG > hdr: [{'WRITE': {'start_time': 1740915920999.0, 'end_time': 1740916521000.0, 'stddev': 'percentile_99_9': 9.51, 'percentile_99_99': 36.77, 'percentile_99_999': 40.86, 'throughput': 17528}}]
< t:2025-03-02 12:24:41,370 f:tester.py       l:3816 c:PerformanceRegressionTest p:INFO  > Build HDR histogram (tags: ['WRITE-rt', 'READ-rt']) with start time: 1740915916.0573575, end time: 1740918249.8044257; for operation: mixed

< t:2025-03-02 12:24:51,824 f:hdrhistogram.py l:243  c:/home/ubuntu/scylla-cluster-tests/sdcm/utils/hdrhistogram.py p:DEBUG > The file '/home/ubuntu/sct-results/20250302-040951-961039/perf-latency-nemesis-ubuntu-loader-set-d492be8f/perf-latency-nemesis-ubuntu-loader-node-d492be8f-1/hdrh-cs-write-l1-c0-k1-b0d9fe7d-1a6c-4c1e-82fa-cce0605690ae.hdr' does not include the time interval from `1740915916.0573575` to `1740918249.8044257`
< t:2025-03-02 12:25:15,127 f:commit_log_check_thread.py l:205  c:CommitLogCheckThread p:DEBUG > overflow_commit_log_directory: []
< t:2025-03-02 12:25:15,132 f:commit_log_check_thread.py l:214  c:CommitLogCheckThread p:DEBUG > zero_free_segments: []
< t:2025-03-02 12:35:15,145 f:commit_log_check_thread.py l:205  c:CommitLogCheckThread p:DEBUG > overflow_commit_log_directory: []
< t:2025-03-02 12:35:15,155 f:commit_log_check_thread.py l:214  c:CommitLogCheckThread p:DEBUG > zero_free_segments: []
< t:2025-03-02 12:45:15,160 f:commit_log_check_thread.py l:205  c:CommitLogCheckThread p:DEBUG > overflow_commit_log_directory: []

We need more print statements for debugging.

Impact

Test stucks and timed out after 2 days

How frequently does it reproduce?

I saw that few times

Installation details

Cluster size: 3 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

  • perf-latency-nemesis-ubuntu-db-node-d492be8f-7 (34.205.246.108 | 10.12.1.104) (shards: 7)
  • perf-latency-nemesis-ubuntu-db-node-d492be8f-6 (44.199.201.208 | 10.12.1.75) (shards: 7)
  • perf-latency-nemesis-ubuntu-db-node-d492be8f-5 (44.195.1.81 | 10.12.1.41) (shards: 7)
  • perf-latency-nemesis-ubuntu-db-node-d492be8f-4 (3.236.155.65 | 10.12.2.157) (shards: 7)
  • perf-latency-nemesis-ubuntu-db-node-d492be8f-3 (18.215.186.105 | 10.12.2.113) (shards: 7)
  • perf-latency-nemesis-ubuntu-db-node-d492be8f-2 (18.213.193.244 | 10.12.0.225) (shards: -1)
  • perf-latency-nemesis-ubuntu-db-node-d492be8f-1 (18.209.237.195 | 10.12.0.160) (shards: 7)

OS / Image: ami-09240a6402ddbb8ce (aws: undefined_region)

Test: scylla-enterprise-perf-regression-latency-650gb-with-nemesis-rbno-disabled
Test id: d492be8f-f802-416d-bf31-dc236166c832
Test name: scylla-enterprise/perf-regression/scylla-enterprise-perf-regression-latency-650gb-with-nemesis-rbno-disabled
Test method: performance_regression_test.PerformanceRegressionTest.test_latency_mixed_with_nemesis
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor d492be8f-f802-416d-bf31-dc236166c832
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs d492be8f-f802-416d-bf31-dc236166c832

Logs:

Jenkins job URL
Argus

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions