improvement(health-check): add plan for optimized health checks on large clusters #13182

Copilot · 2026-01-11T14:45:59Z

SCT Health Check Optimization Plan

TL;DR: Comprehensive plan to reduce health check time from 2+ hours to ~2-15 minutes on 60-node clusters through configurable modes, smart node sampling, and parallel execution.

Problem

Health checks take 2+ hours on 60-node clusters after each nemesis
Checks run even for skipped nemesis operations
Sequential execution with expensive API calls on all nodes

Solution Overview

5 New Configuration Parameters:

cluster_health_check_mode - Control check depth (full/status_only/majority_status/raft_only/sampled)
cluster_health_check_skip_if_nemesis_skipped - Skip redundant checks
cluster_health_check_sample_nodes - Topology-aware sampling (all/cluster_quorum/dc_quorum/rack_quorum/dc_one/rack_one)
cluster_health_check_exclude_running_nemesis - Enable parallel nemesis scenarios
cluster_health_check_parallel_workers - Concurrent execution

Expected Results:

Balanced approach: 120 min → 7-8 min (93% reduction) with good split-brain detection
Fast mode: 120 min → 2 min (98% reduction) for very large clusters

Key Innovations:

Smart node selection: prioritize target node, same rack/region, rotate to avoid blind spots
Configurable trade-offs between speed and validation depth
Support for parallel nemesis testing
Metrics collection for auto-tuning and effectiveness validation

What's in This PR

Detailed optimization plan in /docs/plans/health-check-optimization.md
5 configuration parameters with clear trade-offs
Smart selection heuristics for balanced speed/coverage
Reference performance test plan (7 test cases)
Raft check investigation plan
Configuration suggestion tool proposal
Metrics tracking guidance with issue labeling strategy

Implementation

Implementation will be done in separate PRs following the plan. This PR establishes the technical approach and configuration design.

Testing

N/A - This PR contains only the planning document. Testing will occur in implementation PRs.

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

Original prompt

This section details on the original issue you should resolve

<issue_title>Optimized SCT Health checks (Health checks take more than 2 hours running on 45 DB nodes setup after each nemesis, even skipped ones)</issue_title>
<issue_description>Find ways to lower the time it take to do healthchecks

skip them if nemesis was skipped
have option to only do nodetools status checks, or only raft checks, so it can be quicker

Packages

Scylla version: 2024.3.0~dev-20241209.b5f1d87f3e83 with build-id a322e4f0d7b174dd5052eb3992c8e459d1a03b7a

Kernel Version: 6.8.0-1019-aws

Issue description

Running 5dc, 60 DB nodes test the healch checks take more than 2 hours after each of the nemesis:

2024-12-09 21:19:27,050 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1
...
2024-12-09 22:21:44,490 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-30
...
2024-12-09 23:27:39,576 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60
...
...
2024-12-09 23:29:50,603 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1
...
2024-12-10 01:38:26,928 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60
...
...
2024-12-10 01:40:40,444 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-1
...
2024-12-10 02:43:18,741 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-30
...
2024-12-10 03:49:01,819 ... c:sdcm.utils.health_checker p:DEBUG > Status for regular node long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60

Moreover, redundant health checks cycle gets runs even if nemesis was skipped.

See Argus screenshot:

Impact

Significant waste of a test run time.

How frequently does it reproduce?

1/1

Installation details

Cluster size: 60 nodes (i3en.large)

Scylla Nodes used in this run:

long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-9 (3.78.242.164 | 10.11.4.26) (shards: -1)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-8 (18.184.154.54 | 10.11.4.150) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-7 (18.195.241.190 | 10.11.6.161) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-60 (18.204.213.63 | 10.12.4.51) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-6 (18.156.122.174 | 10.11.7.240) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-59 (52.202.210.200 | 10.12.5.121) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-58 (18.212.49.124 | 10.12.5.222) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-57 (18.205.27.190 | 10.12.4.178) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-56 (3.86.145.120 | 10.12.4.208) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-55 (3.86.180.35 | 10.12.4.253) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-54 (54.172.31.83 | 10.12.4.133) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-53 (54.152.61.101 | 10.12.7.65) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-52 (3.84.51.77 | 10.12.7.208) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-51 (3.80.131.177 | 10.12.5.152) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-50 (44.203.148.120 | 10.12.7.175) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-5 (3.69.165.13 | 10.11.4.30) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-49 (54.83.154.123 | 10.12.4.235) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-48 (35.178.205.245 | 10.3.7.244) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-47 (13.41.185.36 | 10.3.4.203) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-46 (18.133.242.167 | 10.3.6.76) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-45 (13.40.198.57 | 10.3.4.39) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-44 (3.10.208.246 | 10.3.7.212) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-43 (18.175.190.20 | 10.3.6.14) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-42 (35.179.106.42 | 10.3.7.1) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-41 (13.41.163.90 | 10.3.7.29) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-40 (13.41.188.217 | 10.3.5.48) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-4 (3.122.228.171 | 10.11.4.86) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-39 (13.41.200.137 | 10.3.6.155) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f-38 (35.179.176.200 | 10.3.6.185) (shards: 2)
long-custom-d2-wrkld1-5dc-dev-db-node-54f56c8f...

Fixes Optimized SCT Health checks (Health checks take more than 2 hours running on 45 DB nodes setup after each nemesis, even skipped ones) #9547

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

fruch · 2026-01-11T17:09:17Z

@copilot

First put the plan into /doc/plans/ as a markdown file

The fact that nodetool status is the quickest isn't accurate since we are cross checking for each node report against each other, to be able to detect splitbrain situations.

Maybe we can apply some heuristics, and check majority of the nodes only, would that be enough.

Also I find it weird that the raft check is taking more time, we. should plan building an integration test to help evaluate it and improve each of the methods

Copilot · 2026-01-11T17:15:42Z

@copilot

First put the plan into /doc/plans/ as a markdown file...

Created detailed plan in /docs/plans/health-check-optimization.md (commit a69e73d).

Key corrections made:

Fixed nodetool status analysis: it requires cross-checking multiple nodes for split-brain detection, not just a single call
Added majority_status mode that balances speed with split-brain detection by checking consensus among >50% of nodes
Trade-off clearly documented: status_only is fast but may miss split-brain, majority_status provides good detection with 87% time reduction

Raft investigation plan added:

TC4 in integration test plan specifically measures Raft check timing
Instrumentation to identify bottleneck (query time, parsing, connection overhead)
Plan to compare with nodetool alternatives
Will test at 10, 30, and 60 node scales to identify if it scales linearly

The integration test plan includes 7 test cases to evaluate and improve each method with specific expected outcomes and metrics to collect.

fruch · 2026-01-11T21:28:07Z

@copilot

fix pre-commit formatting errors

docs/plans/health-check-optimization.md

Copilot · 2026-01-11T21:30:55Z

@copilot

fix pre-commit formatting errors

Fixed trailing whitespace formatting errors in commit 184d6f6.

Removed trailing whitespace from 11 lines in the plan document to comply with pre-commit checks.

fruch · 2026-01-11T21:36:18Z

@vponomaryov can you take review this plan ?

fruch · 2026-01-14T11:15:51Z

@scylladb/qa-maintainers @temichus @aleksbykov comments abut this plan are welcomed (see the markdown it's introducing

soyacz

I like the approach for having plans and review like this before implementing.

At this stage I would say we don't need sections:

Recommended Configurations
Implementation Phases & Timeline
Integration Test Plan (commented separately)

docs/plans/health-check-optimization.md

vponomaryov

LGTM, just need to squash commits into 1

- Address review feedback: moved plan from PR description to docs/plans/ - Corrected nodetool status analysis: requires cross-checking for split-brain - Added investigation plan for unexpectedly slow Raft checks - Added integration test plan to measure and optimize each method - Documented trade-offs between speed and split-brain detection - Added majority validation heuristic for balanced approach - Included 7 integration test cases with expected outcomes Co-authored-by: fruch <[email protected]>

fruch · 2026-01-20T18:41:02Z

@copilot

start the first phase of the plan

Copilot · 2026-01-20T18:43:03Z

@copilot

start the first phase of the plan

Phase 1 implementation will be done in a separate PR focused on code changes. This PR contains the optimization plan document that has been reviewed and approved. The plan serves as the specification for the implementation work.

To begin Phase 1 implementation, I'll need to create a new PR that implements the cluster_health_check_skip_if_nemesis_skipped feature as described in the plan.

fruch · 2026-01-20T22:22:37Z

docs/plans/health-check-optimization.md

+   - `status_only`: ~20% miss rate (acceptable for quick checks?)
+   - `majority_status`: ~5% miss rate (acceptable for normal checks?)
+
+2. **Should we cache Raft group0 state?**


no, caching status check is counter productive, we care about current status, not status from last time we checked

Removed the caching question in commit ef39d28. Health checks now always fetch current status, not stale cached data.

fruch · 2026-01-20T22:23:40Z

docs/plans/health-check-optimization.md

+   - Pro: Faster checks
+   - Con: May show stale data if topology changes rapidly
+
+3. **Should sampling be deterministic or random?**


something in the middle, we should have some nodes that should be favored, but still we should aim to randomize the ones aren't favored

Updated sampling question in commit ef39d28 to reflect balanced approach: favor certain nodes (target, recently modified, same rack/region) while randomizing selection among remaining nodes for better coverage.

docs/plans/health-check-optimization.md

@fruch

Three key updates to the Open Questions section: 1. Remove caching question (comment 2710278999) - Removed question about caching Raft group0 state - Caching is counter-productive as we need current status, not stale data - Health checks must reflect real-time cluster state 2. Update sampling strategy question (comment 2710282161) - Changed from deterministic-vs-random binary choice - Now describes balanced approach: favor certain nodes + randomize others - Favored nodes: target node, recently modified, same rack/region - Randomize among remaining nodes for better coverage - Combines deterministic priority with random coverage 3. Add configuration suggestion tool question (comment 2710284993) - New question about providing a script/tool for config suggestions - Tool would suggest optimal config based on cluster size, test type, risk tolerance - Example command: `sct.py suggest-health-check-config --nodes 60 --dcs 5` - Helps users choose appropriate settings without trial-and-error Renumbered remaining questions for consistency. Addresses feedback from @fruch in comments 2710278999, 2710282161, 2710284993. Co-authored-by: fruch <[email protected]>

docs/plans/health-check-optimization.md

fruch · 2026-01-20T22:29:37Z

@copilot update the PR description with short executive summary of the plan, and drop all the extra details

@fruch

…trics tracking Two key changes based on PR feedback: 1. Remove "Implementation Phases & Timeline" section (comment 2710294293) - Removed detailed week-by-week implementation schedule - Implementation will be tracked through separate PRs - Keeps plan focused on technical approach, not schedule 2. Enhance metrics tracking question (comment 2710292646) - Add specific guidance on collecting metrics into Adaptive Timeout system - Define issue labeling approach: use `detected-by:health-check` label - Clarify tracking challenges: false positives require manual validation - Explain how metrics enable auto-tuning and effectiveness validation These changes keep the plan focused on the technical solution while adding practical guidance on metrics collection and issue tracking. Addresses feedback from @fruch in comments 2710294293, 2710292646. Co-authored-by: fruch <[email protected]>

Copilot · 2026-01-20T22:33:28Z

@copilot update the PR description with short executive summary of the plan, and drop all the extra details

Updated PR description with executive summary in commit 9835e90.

New format:

TL;DR with problem/solution/results
5 configuration parameters overview
Expected results: 93-98% time reduction
Key innovations highlighted
Removed verbose implementation details and checklists

The description now provides a concise overview while linking to the full plan document for details.

Copilot AI assigned Copilot and fruch Jan 11, 2026

Copilot started work on behalf of fruch January 11, 2026 14:46 View session

Copilot AI changed the title ~~[WIP] Optimize SCT health checks to reduce execution time~~ improvement(health-check): add plan for optimized health checks on large clusters Jan 11, 2026

Copilot AI requested a review from fruch January 11, 2026 14:50

Copilot finished work on behalf of fruch January 11, 2026 14:50

Copilot started work on behalf of fruch January 11, 2026 17:09 View session

Copilot finished work on behalf of fruch January 11, 2026 17:17

Copilot started work on behalf of fruch January 11, 2026 21:28 View session

fruch reviewed Jan 11, 2026

View reviewed changes

docs/plans/health-check-optimization.md Outdated Show resolved Hide resolved

Copilot AI requested a review from fruch January 11, 2026 21:31

Copilot finished work on behalf of fruch January 11, 2026 21:31

Copilot started work on behalf of fruch January 11, 2026 21:32 View session

fruch requested a review from vponomaryov January 11, 2026 21:35

Copilot finished work on behalf of fruch January 11, 2026 21:35

soyacz reviewed Jan 14, 2026

View reviewed changes

docs/plans/health-check-optimization.md Outdated Show resolved Hide resolved

docs/plans/health-check-optimization.md Show resolved Hide resolved

docs/plans/health-check-optimization.md Outdated Show resolved Hide resolved

vponomaryov reviewed Jan 14, 2026

View reviewed changes

fruch added migrate_to_jira size/large P3 Medium Priority labels Jan 18, 2026

Copilot started work on behalf of fruch January 20, 2026 08:04 View session

Copilot finished work on behalf of fruch January 20, 2026 08:09

Copilot started work on behalf of fruch January 20, 2026 08:10 View session

Copilot finished work on behalf of fruch January 20, 2026 08:16

Copilot started work on behalf of fruch January 20, 2026 08:17 View session

Copilot finished work on behalf of fruch January 20, 2026 08:22

vponomaryov reviewed Jan 20, 2026

View reviewed changes

fruch force-pushed the copilot/optimize-health-checks-duration branch from f78501b to cec6233 Compare January 20, 2026 18:39

fruch requested a review from soyacz January 20, 2026 18:40

Copilot started work on behalf of fruch January 20, 2026 18:41 View session

Copilot finished work on behalf of fruch January 20, 2026 18:44

fruch marked this pull request as ready for review January 20, 2026 22:20

fruch reviewed Jan 20, 2026

View reviewed changes

docs/plans/health-check-optimization.md Show resolved Hide resolved

Copilot started work on behalf of fruch January 20, 2026 22:25 View session

fruch reviewed Jan 20, 2026

View reviewed changes

docs/plans/health-check-optimization.md Outdated Show resolved Hide resolved

fruch reviewed Jan 20, 2026

View reviewed changes

docs/plans/health-check-optimization.md Outdated Show resolved Hide resolved

Copilot finished work on behalf of fruch January 20, 2026 22:31

Copilot AI requested a review from fruch January 20, 2026 22:31

Copilot started work on behalf of fruch January 20, 2026 22:31 View session

Copilot finished work on behalf of fruch January 20, 2026 22:34

improvement(health-check): add plan for optimized health checks on large clusters #13182

Are you sure you want to change the base?

improvement(health-check): add plan for optimized health checks on large clusters #13182

Uh oh!

Conversation

Copilot AI commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SCT Health Check Optimization Plan

Problem

Solution Overview

What's in This PR

Implementation

Testing

PR pre-checks (self review)

Reminders

Packages

Issue description

Impact

How frequently does it reproduce?

Installation details

Uh oh!

fruch commented Jan 11, 2026

Uh oh!

Copilot AI commented Jan 11, 2026

Uh oh!

fruch commented Jan 11, 2026

Uh oh!

Uh oh!

Copilot AI commented Jan 11, 2026

Uh oh!

fruch commented Jan 11, 2026

Uh oh!

fruch commented Jan 14, 2026

Uh oh!

soyacz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vponomaryov left a comment

Choose a reason for hiding this comment

Uh oh!

fruch commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

fruch Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

fruch Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fruch commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Jan 11, 2026 •

edited

Loading