Skip to content

fix(seeds): change default seeds_selector from 'all' to 'random' 3 nodes#15180

Draft
roydahan wants to merge 3 commits into
masterfrom
fix-seeds-selector-default
Draft

fix(seeds): change default seeds_selector from 'all' to 'random' 3 nodes#15180
roydahan wants to merge 3 commits into
masterfrom
fix-seeds-selector-default

Conversation

@roydahan

@roydahan roydahan commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Changes the default seed selection from all nodes to up to 3 random seed nodes
  • Caps seeds_num by the current cluster size so 1-node/2-node tests remain valid with the default seeds_num: 3
  • Makes set_seeds() replace the seed set instead of accumulating stale seed flags across reselection
  • Adds unit coverage for capped random seed selection and seed flag replacement

Motivation

According to the ScyllaDB official docs on seed nodes:

Since ScyllaDB OSS 4.3 / Enterprise 2021.1, a seed node is only used as the initial contact point for a new node joining the cluster to discover ring topology. Once the nodes have joined the cluster, the seed node has no function.

The previous default of seeds_selector: "all" marks every node in the cluster as a seed. This is unnecessary for modern ScyllaDB and was explicitly forbidden in older versions, where at least one non-seed node was required.

Using only one seed is minimal, but it is fragile in SCT: several nemesis operations can terminate, decommission, remove, or replace the seed node. If that happens before a new node is added, SCT may no longer have a valid seed node configured for the new node bootstrap.

Change

Before After
seeds_selector default "all" "random"
seeds_num default 1 (ignored when selector is "all") 3

With seeds_selector: "random" and seeds_num: 3, SCT keeps a bounded set of seed nodes that is more resilient to nemesis operations than a single seed, while avoiding the unnecessary all-nodes-as-seeds configuration.

Small clusters

seeds_num now behaves as a maximum for first/random selection. If a cluster has fewer than 3 nodes, SCT selects all available nodes instead of failing config validation or random.sample().

Testing

  • uv run sct.py unit-tests -t unit/test_seed_selector.py

According to ScyllaDB official docs (https://opensource.docs.scylladb.com/stable/kb/seed-nodes.html),
since ScyllaDB OSS 4.3 / Enterprise 2021.1 seeds are only used as the initial contact point
for nodes joining the cluster. Once a node has joined, the seed designation has no ongoing
function. Marking every node as a seed ('all') is therefore unnecessary, and was explicitly
forbidden in older ScyllaDB versions where at least one non-seed node was required.

Changing the default to 'first' (with seeds_num=1) aligns with the minimal-seed recommendation
from the official docs while remaining safe for all supported ScyllaDB versions.
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Updates the default test configuration to use seeds_selector: "first" instead of "all". Expands SCTConfiguration documentation for seeds_selector and seeds_num to clarify seed-node semantics. Adds decommissioning logic that prevents invalid seed states when the target node is the last seed by selecting an alternate seed candidate and adjusting the replacement node's seed flag accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Bug

Suggested reviewers

  • fruch
  • cezarmoise
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The title states 'change default seeds_selector from 'all' to 'random' 3 nodes' but the actual change is to 'first', not 'random'. This is misleading and does not match the implementation. Update the title to accurately reflect the change: 'fix(seeds): change default seeds_selector from 'all' to 'first' with seeds_num: 1'
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description is complete with summary, motivation, changes, and testing details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@scylladb-promoter

scylladb-promoter commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

❌ Test Summary: FAILED

❌ Precommit: FAILED

Total Passed Failed Skipped
50 16 1 33

Failed Hooks

update-conf-docs

Output
- hook id: update-conf-docs
- files were modified by this hook

/usr/local/lib/python3.14/site-packages/azure/mgmt/compute/models/_models.py:12585: SyntaxWarning: "\W" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\W"? A raw string is also an option.
  <br>Has upper characters <br> Has a digit <br> Has a special character (Regex match [\W_])
/usr/local/lib/python3.14/site-packages/azure/mgmt/compute/models/_models.py:12695: SyntaxWarning: "\W" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\W"? A raw string is also an option.
  <br>Has upper characters <br> Has a digit <br> Has a special character (Regex match [\W_])
/usr/local/lib/python3.14/site-packages/azure/mgmt/compute/models/_models.py:23850: SyntaxWarning: "\W" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\W"? A raw string is also an option.
  <br>Has upper characters <br> Has a digit <br> Has a special character (Regex match [\W_])
docs written into docs/configuration_options.md

✅ Tests: PASSED

Total Passed Failed Errors Skipped
3656 3625 0 0 31

Full build log

@roydahan

Copy link
Copy Markdown
Contributor Author

Still need to verify that in case the seed node was terminated/removed/decommissioned by a nemesis, we replace the seed.

In the past we had special handling for seed nodes, I don't know if it's still the case.

With a single-seed default, generic decommission nemesis can target the only seed node.
Transfer seed responsibility to another live node before decommissioning the last seed
so adding a replacement node still has a valid seed_provider to use during bootstrap.
@roydahan roydahan marked this pull request as draft June 23, 2026 16:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@sdcm/nemesis/__init__.py`:
- Around line 1295-1305: The seed node update logic in this block (calling
set_seed_flag() and update_seed_provider()) does not work effectively on
Kubernetes where seed_nodes is empty and these operations are no-ops, which
could leave the cluster without a seed node. Add a check to prevent this branch
from executing when running on Kubernetes or when seed providers are not
available, ensuring that the decommission/terminate operation does not proceed
if it would eliminate the last seed node and the seed updates would not actually
take effect due to the cluster backend limitations.
- Around line 1296-1306: The seed candidate selection in the block starting with
the seed_candidates list comprehension uses raw cluster.nodes without validating
the node's availability or allocation status, which risks selecting a node that
is down or already under another nemesis operation. Replace the direct
cluster.nodes filtering with NemesisNodeAllocator to properly reserve the new
seed node, ensuring you filter for up-normal nodes only, and skip the seed
promotion operation if no suitable candidates are available through the
allocator. This prevents the decommissioning logic from accidentally eliminating
the last seed by promoting an unavailable node.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a69696f8-b25d-4523-b0f3-91ad97c267d9

📥 Commits

Reviewing files that changed from the base of the PR and between b45985c and 1e1390d.

📒 Files selected for processing (1)
  • sdcm/nemesis/__init__.py

Comment thread sdcm/nemesis/__init__.py Outdated
Comment thread sdcm/nemesis/__init__.py Outdated
Use up to three randomly selected seed nodes by default so nemesis operations that remove or replace one seed still leave other seed contact points available. Cap the requested seed count by the current cluster size so small clusters remain valid, and replace stale seed flags when reselection runs.
@roydahan roydahan changed the title fix(seeds): change default seeds_selector from 'all' to 'first' fix(seeds): change default seeds_selector from 'all' to 'random' 3 nodes Jun 23, 2026
@fruch

fruch commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Still need to verify that in case the seed node was terminated/removed/decommissioned by a nemesis, we replace the seed.

In the past we had special handling for seed nodes, I don't know if it's still the case.

We don't, but this first needs to be reviewed and tested before enabling it by default.

We should add one more option to select 2 or 3, since we can't greentree the one selected will be up (parallel nemesis case, even that we don't do topology changes in parallel), since the safe recommended option would be to have couple of nodes as contact points for fallback if needed (even if it's a short while)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants