-
Notifications
You must be signed in to change notification settings - Fork 110
improvement(health-check): add plan for optimized health checks on large clusters #13182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
First put the plan into /doc/plans/ as a markdown file The fact that nodetool status is the quickest isn't accurate since we are cross checking for each node report against each other, to be able to detect splitbrain situations. Maybe we can apply some heuristics, and check majority of the nodes only, would that be enough. Also I find it weird that the raft check is taking more time, we. should plan building an integration test to help evaluate it and improve each of the methods |
Created detailed plan in Key corrections made:
Raft investigation plan added:
The integration test plan includes 7 test cases to evaluate and improve each method with specific expected outcomes and metrics to collect. |
|
fix pre-commit formatting errors |
|
@vponomaryov can you take review this plan ? |
|
@scylladb/qa-maintainers @temichus @aleksbykov comments abut this plan are welcomed (see the markdown it's introducing |
soyacz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach for having plans and review like this before implementing.
At this stage I would say we don't need sections:
- Recommended Configurations
- Implementation Phases & Timeline
- Integration Test Plan (commented separately)
vponomaryov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just need to squash commits into 1
- Address review feedback: moved plan from PR description to docs/plans/ - Corrected nodetool status analysis: requires cross-checking for split-brain - Added investigation plan for unexpectedly slow Raft checks - Added integration test plan to measure and optimize each method - Documented trade-offs between speed and split-brain detection - Added majority validation heuristic for balanced approach - Included 7 integration test cases with expected outcomes Co-authored-by: fruch <[email protected]>
f78501b to
cec6233
Compare
|
start the first phase of the plan |
Phase 1 implementation will be done in a separate PR focused on code changes. This PR contains the optimization plan document that has been reviewed and approved. The plan serves as the specification for the implementation work. To begin Phase 1 implementation, I'll need to create a new PR that implements the |
| - `status_only`: ~20% miss rate (acceptable for quick checks?) | ||
| - `majority_status`: ~5% miss rate (acceptable for normal checks?) | ||
|
|
||
| 2. **Should we cache Raft group0 state?** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, caching status check is counter productive, we care about current status, not status from last time we checked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the caching question in commit ef39d28. Health checks now always fetch current status, not stale cached data.
| - Pro: Faster checks | ||
| - Con: May show stale data if topology changes rapidly | ||
|
|
||
| 3. **Should sampling be deterministic or random?** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something in the middle, we should have some nodes that should be favored, but still we should aim to randomize the ones aren't favored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated sampling question in commit ef39d28 to reflect balanced approach: favor certain nodes (target, recently modified, same rack/region) while randomizing selection among remaining nodes for better coverage.
Three key updates to the Open Questions section: 1. Remove caching question (comment 2710278999) - Removed question about caching Raft group0 state - Caching is counter-productive as we need current status, not stale data - Health checks must reflect real-time cluster state 2. Update sampling strategy question (comment 2710282161) - Changed from deterministic-vs-random binary choice - Now describes balanced approach: favor certain nodes + randomize others - Favored nodes: target node, recently modified, same rack/region - Randomize among remaining nodes for better coverage - Combines deterministic priority with random coverage 3. Add configuration suggestion tool question (comment 2710284993) - New question about providing a script/tool for config suggestions - Tool would suggest optimal config based on cluster size, test type, risk tolerance - Example command: `sct.py suggest-health-check-config --nodes 60 --dcs 5` - Helps users choose appropriate settings without trial-and-error Renumbered remaining questions for consistency. Addresses feedback from @fruch in comments 2710278999, 2710282161, 2710284993. Co-authored-by: fruch <[email protected]>
|
@copilot update the PR description with short executive summary of the plan, and drop all the extra details |
…trics tracking Two key changes based on PR feedback: 1. Remove "Implementation Phases & Timeline" section (comment 2710294293) - Removed detailed week-by-week implementation schedule - Implementation will be tracked through separate PRs - Keeps plan focused on technical approach, not schedule 2. Enhance metrics tracking question (comment 2710292646) - Add specific guidance on collecting metrics into Adaptive Timeout system - Define issue labeling approach: use `detected-by:health-check` label - Clarify tracking challenges: false positives require manual validation - Explain how metrics enable auto-tuning and effectiveness validation These changes keep the plan focused on the technical solution while adding practical guidance on metrics collection and issue tracking. Addresses feedback from @fruch in comments 2710294293, 2710292646. Co-authored-by: fruch <[email protected]>
Updated PR description with executive summary in commit 9835e90. New format:
The description now provides a concise overview while linking to the full plan document for details. |
SCT Health Check Optimization Plan
TL;DR: Comprehensive plan to reduce health check time from 2+ hours to ~2-15 minutes on 60-node clusters through configurable modes, smart node sampling, and parallel execution.
Problem
Solution Overview
5 New Configuration Parameters:
cluster_health_check_mode- Control check depth (full/status_only/majority_status/raft_only/sampled)cluster_health_check_skip_if_nemesis_skipped- Skip redundant checkscluster_health_check_sample_nodes- Topology-aware sampling (all/cluster_quorum/dc_quorum/rack_quorum/dc_one/rack_one)cluster_health_check_exclude_running_nemesis- Enable parallel nemesis scenarioscluster_health_check_parallel_workers- Concurrent executionExpected Results:
Key Innovations:
What's in This PR
/docs/plans/health-check-optimization.mdImplementation
Implementation will be done in separate PRs following the plan. This PR establishes the technical approach and configuration design.
Testing
PR pre-checks (self review)
backportlabelsReminders
sdcm/sct_config.py)unit-test/folder)Original prompt
This section details on the original issue you should resolve
<issue_title>Optimized SCT Health checks (Health checks take more than 2 hours running on 45 DB nodes setup after each nemesis, even skipped ones)</issue_title>
<issue_description>Find ways to lower the time it take to do healthchecks
Packages
Scylla version:
2024.3.0~dev-20241209.b5f1d87f3e83with build-ida322e4f0d7b174dd5052eb3992c8e459d1a03b7aKernel Version:
6.8.0-1019-awsIssue description
Running 5dc, 60 DB nodes test the healch checks take more than 2 hours after each of the nemesis:
Moreover, redundant health checks cycle gets runs even if nemesis was skipped.
See Argus screenshot:

Impact
Significant waste of a test run time.
How frequently does it reproduce?
1/1
Installation details
Cluster size: 60 nodes (i3en.large)
Scylla Nodes used in this run:
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.