fix(seeds): change default seeds_selector from 'all' to 'random' 3 nodes#15180
fix(seeds): change default seeds_selector from 'all' to 'random' 3 nodes#15180roydahan wants to merge 3 commits into
Conversation
According to ScyllaDB official docs (https://opensource.docs.scylladb.com/stable/kb/seed-nodes.html), since ScyllaDB OSS 4.3 / Enterprise 2021.1 seeds are only used as the initial contact point for nodes joining the cluster. Once a node has joined, the seed designation has no ongoing function. Marking every node as a seed ('all') is therefore unnecessary, and was explicitly forbidden in older ScyllaDB versions where at least one non-seed node was required. Changing the default to 'first' (with seeds_num=1) aligns with the minimal-seed recommendation from the official docs while remaining safe for all supported ScyllaDB versions.
📝 WalkthroughWalkthroughUpdates the default test configuration to use Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
❌ Test Summary: FAILED❌ Precommit: FAILED
Failed Hooks
Output✅ Tests: PASSED
|
|
Still need to verify that in case the seed node was terminated/removed/decommissioned by a nemesis, we replace the seed. In the past we had special handling for seed nodes, I don't know if it's still the case. |
With a single-seed default, generic decommission nemesis can target the only seed node. Transfer seed responsibility to another live node before decommissioning the last seed so adding a replacement node still has a valid seed_provider to use during bootstrap.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@sdcm/nemesis/__init__.py`:
- Around line 1295-1305: The seed node update logic in this block (calling
set_seed_flag() and update_seed_provider()) does not work effectively on
Kubernetes where seed_nodes is empty and these operations are no-ops, which
could leave the cluster without a seed node. Add a check to prevent this branch
from executing when running on Kubernetes or when seed providers are not
available, ensuring that the decommission/terminate operation does not proceed
if it would eliminate the last seed node and the seed updates would not actually
take effect due to the cluster backend limitations.
- Around line 1296-1306: The seed candidate selection in the block starting with
the seed_candidates list comprehension uses raw cluster.nodes without validating
the node's availability or allocation status, which risks selecting a node that
is down or already under another nemesis operation. Replace the direct
cluster.nodes filtering with NemesisNodeAllocator to properly reserve the new
seed node, ensuring you filter for up-normal nodes only, and skip the seed
promotion operation if no suitable candidates are available through the
allocator. This prevents the decommissioning logic from accidentally eliminating
the last seed by promoting an unavailable node.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a69696f8-b25d-4523-b0f3-91ad97c267d9
📒 Files selected for processing (1)
sdcm/nemesis/__init__.py
Use up to three randomly selected seed nodes by default so nemesis operations that remove or replace one seed still leave other seed contact points available. Cap the requested seed count by the current cluster size so small clusters remain valid, and replace stale seed flags when reselection runs.
We don't, but this We should add one more option to select 2 or 3, since we can't greentree the one selected will be up (parallel nemesis case, even that we don't do topology changes in parallel), since the safe recommended option would be to have couple of nodes as contact points for fallback if needed (even if it's a short while) |
Summary
seeds_numby the current cluster size so 1-node/2-node tests remain valid with the defaultseeds_num: 3set_seeds()replace the seed set instead of accumulating stale seed flags across reselectionMotivation
According to the ScyllaDB official docs on seed nodes:
The previous default of
seeds_selector: "all"marks every node in the cluster as a seed. This is unnecessary for modern ScyllaDB and was explicitly forbidden in older versions, where at least one non-seed node was required.Using only one seed is minimal, but it is fragile in SCT: several nemesis operations can terminate, decommission, remove, or replace the seed node. If that happens before a new node is added, SCT may no longer have a valid seed node configured for the new node bootstrap.
Change
seeds_selectordefault"all""random"seeds_numdefault1(ignored when selector is"all")3With
seeds_selector: "random"andseeds_num: 3, SCT keeps a bounded set of seed nodes that is more resilient to nemesis operations than a single seed, while avoiding the unnecessary all-nodes-as-seeds configuration.Small clusters
seeds_numnow behaves as a maximum forfirst/randomselection. If a cluster has fewer than 3 nodes, SCT selects all available nodes instead of failing config validation orrandom.sample().Testing
uv run sct.py unit-tests -t unit/test_seed_selector.py