Skip to content

scylla-node: start: do not preserve smp and memory passed via env or cmd_line#718

Open
bhalevy wants to merge 1 commit intoscylladb:masterfrom
bhalevy:make-smp-and-memory-cmd-line-args-transient
Open

scylla-node: start: do not preserve smp and memory passed via env or cmd_line#718
bhalevy wants to merge 1 commit intoscylladb:masterfrom
bhalevy:make-smp-and-memory-cmd-line-args-transient

Conversation

@bhalevy
Copy link
Member

@bhalevy bhalevy commented Feb 10, 2026

If a test wants to preserve memory and/or smp across node restarts, it should use the methods that were intended for this: set_smp and set_mem_mb_per_cpu.

Preserving those implicitly when passed to start was not intended.

Fixes: QATOOLS-138

@bhalevy bhalevy requested review from Copilot, fruch and paszkow February 10, 2026 10:48
@bhalevy bhalevy changed the base branch from master to next February 10, 2026 10:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes how Scylla nodes build their startup arguments so --smp/--memory passed via SCYLLA_EXT_OPTS or jvm_args no longer implicitly persist across restarts (tests should use set_smp() / set_mem_mb_per_cpu() instead). In addition, it includes a broad set of refactors and tooling/test additions across CCM (packaging/CI, log parsing limits, config merge semantics, topology defaults, etc.).

Changes:

  • Adjust Scylla node startup option handling (normalize short opts; ignore --smp/--memory from env/cmdline for persistence; always add computed values).
  • Add/extend utilities and tests (config deep-merge, capped log grep, cluster cleanup behavior, build mode persistence/extraction, improved version error messages).
  • Migrate build/test tooling (move to pyproject.toml/uv, update CI and Nix configuration, remove SNI proxy support/files).

Reviewed changes

Copilot reviewed 47 out of 50 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/test_version_error_messages.py Adds tests for clearer version-detection error paths.
tests/test_scylla_repository.py Adjusts repository tests for new caching behavior.
tests/test_scylla_ext_opts.py Adds tests for short-form option normalization in SCYLLA_EXT_OPTS.
tests/test_scylla_cmds.py Removes SNI proxy-related CLI tests.
tests/test_max_log_matches.py Adds tests for limiting log-match collection.
tests/test_help.py Adds a basic CLI help/usage smoke test.
tests/test_common.py Adds tests for config merging and scylla mode persistence.
tests/test_cluster_cleanup.py Adds tests for new cluster_cleanup() behavior.
tests/test_cluster_add_cmd.py Adds tests for JMX port conflict logic when adding nodes.
tests/test_build_mode_extraction.py Adds tests for extracting build mode from relocatable installs.
tests/test_add_node_without_datacenter.py Adds tests for inferring DC/rack when adding nodes.
tests/ccmcluster.py Removes SNI proxy start helper from test harness.
ssl/ccm_node.pem Adds SSL cert material for tests/SSL features.
ssl/ccm_node.key Adds SSL private key material for tests/SSL features.
ssl/ccm_node.crl Adds CRL file alongside SSL test artifacts.
ssl/ccm_node.cer Adds certificate file alongside SSL test artifacts.
setup.py Removes legacy setuptools installer script.
requirements-test.txt Removes legacy test requirements file.
pytest.ini Removes legacy pytest configuration file (migrated to pyproject.toml).
pyproject.toml Introduces PEP 621 packaging + pytest config + uv dev deps.
flake.nix Updates Nix build/dev setup (pyproject format, Python version updates).
flake.lock Updates pinned Nix inputs.
ccmlib/utils/sni_proxy.py Removes SNI proxy implementation.
ccmlib/scylla_repository.py Adds caching + tweaks manager package handling.
ccmlib/scylla_node.py Updates start arg handling, adds option normalization, adds repair/sstable changes.
ccmlib/scylla_docker_cluster.py Adjusts create_node signature; warns on deprecated thrift interface.
ccmlib/scylla_cluster.py Persists scylla mode, adds SSL helper, adjusts repair/timeout behavior, removes SNI proxy integration.
ccmlib/resources/docker/sniproxy/Dockerfile Removes SNI proxy Dockerfile.
ccmlib/repository.py Adds caching to Cassandra repository setup.
ccmlib/node.py Adds deep-merge config behavior; adds log-match limits; host-id parsing improvements.
ccmlib/dse_node.py Adjusts constructor signature consistent with Node changes.
ccmlib/dse_cluster.py Adjusts create_node signature; warns on deprecated thrift interface.
ccmlib/common.py Adds config deep-merge helper; improves version error messages; custom JAVA_HOME support.
ccmlib/cmds/node_cmds.py Removes SNI proxy CLI wiring.
ccmlib/cmds/cluster_cmds.py Updates add/start options (rack option, JMX conflict logic, removes SNI proxy flags).
ccmlib/cluster_factory.py Restores persisted scylla mode/ipprefix; makes seed loading more robust.
ccmlib/cluster.py Adds rack/DC inference on add; adds cluster_cleanup; uses deep-merge for config options.
ccmlib/bin/init.py Adds a Python entrypoint module for ccm.
ccm Simplifies wrapper script to call ccmlib.bin.main.
README.md Reworks docs for Scylla fork, uv usage, docker/reloc workflows.
MANIFEST.in Removes SNI proxy docker resource inclusion.
.python-version Pins Python version for local dev tooling.
.gitignore Ignores uv lockfile and adjusts patterns.
.github/workflows/trigger_jenkins.yaml Removes Jenkins trigger workflow.
.github/workflows/nix.yml Updates Nix CI flow and Python version used.
.github/workflows/integration-tests.yml Migrates CI install to uv and updates Python/Java matrix.
.github/workflows/close_issue_for_scylla_employee.yml Adds automation to comment/close issues under certain conditions.
.github/workflows/ci-tests.yml Tweaks PR branch triggers.
.github/workflows/call_jira_sync.yml Adds Jira sync workflow hooks.
.github/copilot-instructions.md Adds repo-specific Copilot guidance doc.

Comment on lines 374 to 376
if max_matches is None:
max_matches = int(os.environ.get('DTEST_MAX_LOG_MATCHES', '1000'))

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DTEST_MAX_LOG_MATCHES is parsed with int(...) without validation. If the env var is set to a non-integer value, log parsing will raise ValueError and potentially break unrelated code paths. Consider handling invalid values gracefully (e.g., defaulting to 1000 with a warning).

Copilot uses AI. Check for mistakes.
Comment on lines 2013 to 2015
url = f"http://{self.address()}:{self.api_port}/storage_service/keyspaces"
resp = requests.get(url=url, params={"replication": "vnodes"})
resp.raise_for_status()
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTTP call in repair() uses requests.get(...) without a timeout, which can hang indefinitely if the API endpoint is slow/unresponsive. Add an explicit timeout (and consider handling requests.RequestException) to keep repair operations bounded.

Copilot uses AI. Check for mistakes.
Comment on lines 14 to 16
dependencies = [
"ruamel-yaml",
"psutil",
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency is listed as "ruamel-yaml", but the project imports ruamel.yaml. The PyPI package name used elsewhere in this repo (previous setup.py) is ruamel.yaml; using the hyphenated name is likely to break installs/uv sync. Update the dependency entry to the correct distribution name.

Copilot uses AI. Check for mistakes.
Comment on lines 226 to 227
@lru_cache(maxsize=None)
def setup(version, verbose=True, skip_downloads=False):
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup() is now @lru_cached, but its behavior depends on process environment (e.g., SCYLLA_MANAGER_PACKAGE, SCYLLA_EXT_OPTS) and it mutates os.environ. Caching only on (version, verbose, skip_downloads) can return stale results and skip required env updates if those env vars change within the same process. Consider removing caching here or moving env-dependent side effects out of the cached function (or include the relevant env inputs in the cache key).

Copilot uses AI. Check for mistakes.
Comment on lines 643 to 650
# Try cluster cleanup on the first running node
try:
nodes[0].nodetool("cluster cleanup")
except NodetoolError:
# Fallback: run regular cleanup on all nodes except the last (command doesn't exist)
# The last node added to the cluster doesn't need cleanup
for node in nodes[:-1]:
node.nodetool("cleanup")
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_cleanup() falls back to per-node cleanup on any NodetoolError. That can mask real failures of nodetool cluster cleanup (e.g., auth/network issues) and proceed with a different operation unexpectedly. Tighten the exception handling to only fall back when the error indicates the subcommand is unsupported (e.g., "Unknown command"), and re-raise otherwise.

Copilot uses AI. Check for mistakes.
Comment on lines 1966 to 1978
if end_token:
options.append("--end-token")
options.append(f"{end_token}")

if local:
options.append("--in-local-dc")

if partitioner_range:
options.append("--partitioner-range")

if start_token:
options.append("--start-token")
options.append(f"{start_token}")
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end_token/start_token are checked with truthiness (if end_token: / if start_token:). Token values can legitimately be 0 (or other falsy values), which would incorrectly omit the corresponding CLI flags. Use explicit is not None checks for these optional integers.

Copilot uses AI. Check for mistakes.
Comment on lines 688 to 692
# Lets search for default overrides in SCYLLA_EXT_OPTS
env_args = process_opts(os.getenv('SCYLLA_EXT_OPTS', "").split())

smp = self._smp
memory_in_mb = None
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description focuses on not preserving --smp/--memory across restarts, but this PR also introduces a large set of unrelated changes (packaging migration to pyproject.toml, removal of SNI proxy support, new SSL helpers, new repair APIs, etc.). This scope mismatch makes the change set harder to review and increases risk; consider splitting into focused PRs.

Copilot uses AI. Check for mistakes.
Comment on lines 1 to 5
-----BEGIN PRIVATE KEY-----
MIIG/AIBADANBgkqhkiG9w0BAQEFAASCBuYwggbiAgEAAoIBgQCqpkPYb3OTv3xH
xAS4QLlzOTAmjtooXJs67XEQEh41rBU631PZa0fnVKHrYCpXmXgEJCFbWYky3tLT
vrN04ry2HNlGJxjSvXBlzpaHDYIWwga9D4PPdYvLDN2Dd/BNLG4/slyBAK6yowIu
raxgtHSEZqqE6qF2ZJep9hvV803lXK7JqckhGxwEUPldEDUzn7Fvs829ipczkGZN
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file adds an unencrypted private key to the repository. Even if intended only for local testing, committing private keys is risky and can encourage insecure defaults if reused by users. Prefer generating test certificates/keys at runtime (or storing them under a clearly-scoped test fixture directory with prominent documentation), and ensure they are never used as a default for real clusters.

Copilot uses AI. Check for mistakes.
@bhalevy bhalevy force-pushed the make-smp-and-memory-cmd-line-args-transient branch from f46aee0 to 97e313d Compare February 10, 2026 11:02
@bhalevy bhalevy changed the base branch from next to master February 10, 2026 11:04
@bhalevy
Copy link
Member Author

bhalevy commented Feb 10, 2026

@fruch does the target branch needs to be master or next?
It looks like master diverged from next after 5392dd6

@bhalevy
Copy link
Member Author

bhalevy commented Feb 10, 2026

@fruch
Copy link
Contributor

fruch commented Feb 10, 2026

@fruch does the target branch needs to be master or next? It looks like master diverged from next after 5392dd6

ccm isn't part of the next workflow, since it's now a submodule of dtest, hence master is the right one

@fruch
Copy link
Contributor

fruch commented Feb 10, 2026

@bhalevy

this iis basically a revert of
#714 ?

if so, can you explain what the issue with that ? i.e. there are test breaking cause of that ? why it's wrong ?

@bhalevy bhalevy force-pushed the make-smp-and-memory-cmd-line-args-transient branch from 97e313d to a56a2c0 Compare February 10, 2026 11:15
@bhalevy
Copy link
Member Author

bhalevy commented Feb 10, 2026

@bhalevy

this iis basically a revert of #714 ?

if so, can you explain what the issue with that ? i.e. there are test breaking cause of that ? why it's wrong ?

yes, it basically reverts #714.
First, I'm concerned that #714 might cause dtest failures for tests that do not expect smp to be preserved across restarts, when it is passed in jvm_args.

The root issue is that the jvm_args were not supposed to be preserved in the node state, but rather used only for the specific call to start(). The fact that _memory was preserved was a mistake on my part and it was never documented as such. The patch now only sets it for reporting purposes so we can return in by ScyllaNode.memory(), when it is calculated by smp * mem_mb_per_cpu. I pretty much kept the logic for seeding those value from the SCYLLA_EXT_OPTS environment that may provide defaults, and those could be overridden by the jvm_args command line options.

@bhalevy
Copy link
Member Author

bhalevy commented Feb 10, 2026

@bhalevy bhalevy force-pushed the make-smp-and-memory-cmd-line-args-transient branch 2 times, most recently from 07cd6c6 to 04dfa0d Compare February 10, 2026 11:35
@fruch
Copy link
Contributor

fruch commented Feb 10, 2026

@bhalevy
this iis basically a revert of #714 ?
if so, can you explain what the issue with that ? i.e. there are test breaking cause of that ? why it's wrong ?

yes, it basically reverts #714. First, I'm concerned that #714 might cause dtest failures for tests that do not expect smp to be preserved across restarts, when it is passed in jvm_args.

The root issue is that the jvm_args were not supposed to be preserved in the node state, but rather used only for the specific call to start(). The fact that _memory was preserved was a mistake on my part and it was never documented as such. The patch now only sets it for reporting purposes so we can return in by ScyllaNode.memory(), when it is calculated by smp * mem_mb_per_cpu. I pretty much kept the logic for seeding those value from the SCYLLA_EXT_OPTS environment that may provide defaults, and those could be overridden by the jvm_args command line options.

the clime in #714 is exactly the opposite, that some tests are changing smp during restarts, and that wasn't their intention, so if reverting tests should be fixed first to use apis as you suggest (i.e. all test that are restarting nodes needs to be review)

so I'm not sure what's better, do you concrete example #714 is causing a problem for a test ?

…cmd_line

If a test wants to preserve memory and/or smp across node restarts,
it should use the methods that were intended for this: set_smp and
set_mem_mb_per_cpu.

Preserving those implicitly when passed to start was not intended.

Fixes: QATOOLS-138

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
@bhalevy bhalevy force-pushed the make-smp-and-memory-cmd-line-args-transient branch from 04dfa0d to ef53107 Compare February 10, 2026 11:41
@bhalevy
Copy link
Member Author

bhalevy commented Feb 10, 2026

Add Fixes: QATOOLS-138

@fruch
Copy link
Contributor

fruch commented Feb 10, 2026

Add Fixes: QATOOLS-138

  1. putting this in comment doesn't help (should be in PR description, commit, headline or branch name
  2. the correct place to report it is DTEST, with qa component ccm

@bhalevy
Copy link
Member Author

bhalevy commented Feb 10, 2026

Add Fixes: QATOOLS-138

  1. putting this in comment doesn't help (should be in PR description, commit, headline or branch name

It is both in the PR description and in the commit.

  1. the correct place to report it is DTEST, with qa component ccm

Okay, can you move QATOOLS-138 there or do you want me to open one in DTEST and close the one in QATOOLS,
or should we just leave it as is and learn for next time.

@bhalevy
Copy link
Member Author

bhalevy commented Feb 10, 2026

Update BYO job: jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871

https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871/testReport/

  • All scylla dtests passed.
  • There are some scylla-manager failures unrelated to this PR

@bhalevy
Copy link
Member Author

bhalevy commented Feb 10, 2026

Update BYO job: jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871

jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871/testReport

  • All scylla dtests passed.

  • There are some scylla-manager failures unrelated to this PR

Cc @Michal-Leszczynski

@Michal-Leszczynski
Copy link

SM restore failure is fixed by scylladb/scylla-manager#4753 which I'm planning to merge today and release this week.

In terms of the repair failure, it looks similar to what was described in scylladb/scylla-manager#4529 (comment) because:

  • we are using tablet repair API
  • we have node down
  • we have logs implicating that repair finished successfully repair - repair[0d996146-16a0-45f9-a500-26b1eb02da10]: completed successfully

From my POV SM behaves as expected, as it just waits for tablet repair task to finish.
The problem is that tablet repair task hangs on raft read barrier after the actual repair has already finished.
I don't think that we consider this to be a bug on scylla side, it's kind of expected behavior (cc: @Deexie),
so it's the test that needs to use vnodes or bring the node back up at some point, so that raft read barrier is unblocked.

@Deexie
Copy link
Contributor

Deexie commented Feb 10, 2026

SM restore failure is fixed by scylladb/scylla-manager#4753 which I'm planning to merge today and release this week.

In terms of the repair failure, it looks similar to what was described in scylladb/scylla-manager#4529 (comment) because:

* we are using tablet repair API

* we have node down

* we have logs implicating that repair finished successfully `repair - repair[0d996146-16a0-45f9-a500-26b1eb02da10]: completed successfully`

From my POV SM behaves as expected, as it just waits for tablet repair task to finish. The problem is that tablet repair task hangs on raft read barrier after the actual repair has already finished. I don't think that we consider this to be a bug on scylla side, it's kind of expected behavior (cc: @Deexie), so it's the test that needs to use vnodes or bring the node back up at some point, so that raft read barrier is unblocked.

Yes, tablet repair may never finish if the node is down

Copy link
Contributor

@paszkow paszkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should first go over the dtests tests and call set_smp() and set_memory() whenever necessary and then get this in.

# use '--memory' in jmv_args if mem_mb_per_cpu was not set by the test
if not self._mem_mb_set_during_test and '--memory' in cmd_args:
self._memory = self.parse_size(cmd_args['--memory'][0])
memory_in_mb = self.parse_size(cmd_args['--memory'][0]) // MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you have to compute self._conf_mem_mb_per_cpu = int(memory_in_mb / smp) same way as above when parsing env_args?

Copy link
Member Author

@bhalevy bhalevy Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since memory_in_mb is derived from self._conf_mem_mb_per_cpu only when not provided in any other way.
We set self._conf_mem_mb_per_cpu above only when smp and memory are given by the SCYLLA_EXT_OPTS environment variable (in contrast to jvm_args) and it is not set by the test.

@bhalevy
Copy link
Member Author

bhalevy commented Feb 10, 2026

I think we should first go over the dtests tests and call set_smp() and set_memory() whenever necessary and then get this in.

I'd appreciate if you send a patch to do that.

@fruch
Copy link
Contributor

fruch commented Feb 10, 2026

Add Fixes: QATOOLS-138

  1. putting this in comment doesn't help (should be in PR description, commit, headline or branch name

It is both in the PR description and in the commit.

  1. the correct place to report it is DTEST, with qa component ccm

Okay, can you move QATOOLS-138 there or do you want me to open one in DTEST and close the one in QATOOLS, or should we just leave it as is and learn for next time.

I'll move it in jira

@roydahan
Copy link
Contributor

This should wait after the merge freeze, it's not a CI-stability issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

Comments