scylla-node: start: do not preserve smp and memory passed via env or cmd_line by bhalevy · Pull Request #718 · scylladb/scylla-ccm

bhalevy · 2026-02-10T10:48:46Z

If a test wants to preserve memory and/or smp across node restarts, it should use the methods that were intended for this: set_smp and set_mem_mb_per_cpu.

Preserving those implicitly when passed to start was not intended.

Fixes: QATOOLS-138

Copilot

Pull request overview

This PR changes how Scylla nodes build their startup arguments so --smp/--memory passed via SCYLLA_EXT_OPTS or jvm_args no longer implicitly persist across restarts (tests should use set_smp() / set_mem_mb_per_cpu() instead). In addition, it includes a broad set of refactors and tooling/test additions across CCM (packaging/CI, log parsing limits, config merge semantics, topology defaults, etc.).

Changes:

Adjust Scylla node startup option handling (normalize short opts; ignore --smp/--memory from env/cmdline for persistence; always add computed values).
Add/extend utilities and tests (config deep-merge, capped log grep, cluster cleanup behavior, build mode persistence/extraction, improved version error messages).
Migrate build/test tooling (move to pyproject.toml/uv, update CI and Nix configuration, remove SNI proxy support/files).

Reviewed changes

Copilot reviewed 47 out of 50 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
tests/test_version_error_messages.py	Adds tests for clearer version-detection error paths.
tests/test_scylla_repository.py	Adjusts repository tests for new caching behavior.
tests/test_scylla_ext_opts.py	Adds tests for short-form option normalization in `SCYLLA_EXT_OPTS`.
tests/test_scylla_cmds.py	Removes SNI proxy-related CLI tests.
tests/test_max_log_matches.py	Adds tests for limiting log-match collection.
tests/test_help.py	Adds a basic CLI help/usage smoke test.
tests/test_common.py	Adds tests for config merging and scylla mode persistence.
tests/test_cluster_cleanup.py	Adds tests for new `cluster_cleanup()` behavior.
tests/test_cluster_add_cmd.py	Adds tests for JMX port conflict logic when adding nodes.
tests/test_build_mode_extraction.py	Adds tests for extracting build mode from relocatable installs.
tests/test_add_node_without_datacenter.py	Adds tests for inferring DC/rack when adding nodes.
tests/ccmcluster.py	Removes SNI proxy start helper from test harness.
ssl/ccm_node.pem	Adds SSL cert material for tests/SSL features.
ssl/ccm_node.key	Adds SSL private key material for tests/SSL features.
ssl/ccm_node.crl	Adds CRL file alongside SSL test artifacts.
ssl/ccm_node.cer	Adds certificate file alongside SSL test artifacts.
setup.py	Removes legacy setuptools installer script.
requirements-test.txt	Removes legacy test requirements file.
pytest.ini	Removes legacy pytest configuration file (migrated to `pyproject.toml`).
pyproject.toml	Introduces PEP 621 packaging + pytest config + uv dev deps.
flake.nix	Updates Nix build/dev setup (pyproject format, Python version updates).
flake.lock	Updates pinned Nix inputs.
ccmlib/utils/sni_proxy.py	Removes SNI proxy implementation.
ccmlib/scylla_repository.py	Adds caching + tweaks manager package handling.
ccmlib/scylla_node.py	Updates start arg handling, adds option normalization, adds repair/sstable changes.
ccmlib/scylla_docker_cluster.py	Adjusts create_node signature; warns on deprecated thrift interface.
ccmlib/scylla_cluster.py	Persists scylla mode, adds SSL helper, adjusts repair/timeout behavior, removes SNI proxy integration.
ccmlib/resources/docker/sniproxy/Dockerfile	Removes SNI proxy Dockerfile.
ccmlib/repository.py	Adds caching to Cassandra repository setup.
ccmlib/node.py	Adds deep-merge config behavior; adds log-match limits; host-id parsing improvements.
ccmlib/dse_node.py	Adjusts constructor signature consistent with Node changes.
ccmlib/dse_cluster.py	Adjusts create_node signature; warns on deprecated thrift interface.
ccmlib/common.py	Adds config deep-merge helper; improves version error messages; custom JAVA_HOME support.
ccmlib/cmds/node_cmds.py	Removes SNI proxy CLI wiring.
ccmlib/cmds/cluster_cmds.py	Updates add/start options (rack option, JMX conflict logic, removes SNI proxy flags).
ccmlib/cluster_factory.py	Restores persisted scylla mode/ipprefix; makes seed loading more robust.
ccmlib/cluster.py	Adds rack/DC inference on add; adds `cluster_cleanup`; uses deep-merge for config options.
ccmlib/bin/init.py	Adds a Python entrypoint module for `ccm`.
ccm	Simplifies wrapper script to call `ccmlib.bin.main`.
README.md	Reworks docs for Scylla fork, uv usage, docker/reloc workflows.
MANIFEST.in	Removes SNI proxy docker resource inclusion.
.python-version	Pins Python version for local dev tooling.
.gitignore	Ignores uv lockfile and adjusts patterns.
.github/workflows/trigger_jenkins.yaml	Removes Jenkins trigger workflow.
.github/workflows/nix.yml	Updates Nix CI flow and Python version used.
.github/workflows/integration-tests.yml	Migrates CI install to uv and updates Python/Java matrix.
.github/workflows/close_issue_for_scylla_employee.yml	Adds automation to comment/close issues under certain conditions.
.github/workflows/ci-tests.yml	Tweaks PR branch triggers.
.github/workflows/call_jira_sync.yml	Adds Jira sync workflow hooks.
.github/copilot-instructions.md	Adds repo-specific Copilot guidance doc.

pyproject.toml

Copilot · 2026-02-10T10:58:07Z

ccmlib/node.py

+        if max_matches is None:
+            max_matches = int(os.environ.get('DTEST_MAX_LOG_MATCHES', '1000'))
+


DTEST_MAX_LOG_MATCHES is parsed with int(...) without validation. If the env var is set to a non-integer value, log parsing will raise ValueError and potentially break unrelated code paths. Consider handling invalid values gracefully (e.g., defaulting to 1000 with a warning).

ccmlib/node.py

Copilot · 2026-02-10T10:58:07Z

ccmlib/scylla_node.py

+            url = f"http://{self.address()}:{self.api_port}/storage_service/keyspaces"
+            resp = requests.get(url=url, params={"replication": "vnodes"})
+            resp.raise_for_status()


The HTTP call in repair() uses requests.get(...) without a timeout, which can hang indefinitely if the API endpoint is slow/unresponsive. Add an explicit timeout (and consider handling requests.RequestException) to keep repair operations bounded.

Copilot · 2026-02-10T10:58:08Z

pyproject.toml

+dependencies = [
+    "ruamel-yaml",
+    "psutil",


The dependency is listed as "ruamel-yaml", but the project imports ruamel.yaml. The PyPI package name used elsewhere in this repo (previous setup.py) is ruamel.yaml; using the hyphenated name is likely to break installs/uv sync. Update the dependency entry to the correct distribution name.

Copilot · 2026-02-10T10:58:08Z

ccmlib/scylla_repository.py

+@lru_cache(maxsize=None)
 def setup(version, verbose=True, skip_downloads=False):


setup() is now @lru_cached, but its behavior depends on process environment (e.g., SCYLLA_MANAGER_PACKAGE, SCYLLA_EXT_OPTS) and it mutates os.environ. Caching only on (version, verbose, skip_downloads) can return stale results and skip required env updates if those env vars change within the same process. Consider removing caching here or moving env-dependent side effects out of the cached function (or include the relevant env inputs in the cache key).

Copilot · 2026-02-10T10:58:08Z

ccmlib/cluster.py

+        # Try cluster cleanup on the first running node
+        try:
+            nodes[0].nodetool("cluster cleanup")
+        except NodetoolError:
+            # Fallback: run regular cleanup on all nodes except the last (command doesn't exist)
+            # The last node added to the cluster doesn't need cleanup
+            for node in nodes[:-1]:
+                node.nodetool("cleanup")


cluster_cleanup() falls back to per-node cleanup on any NodetoolError. That can mask real failures of nodetool cluster cleanup (e.g., auth/network issues) and proceed with a different operation unexpectedly. Tighten the exception handling to only fall back when the error indicates the subcommand is unsupported (e.g., "Unknown command"), and re-raise otherwise.

Copilot · 2026-02-10T10:58:08Z

ccmlib/scylla_node.py

+        if end_token:
+            options.append("--end-token")
+            options.append(f"{end_token}")
+
+        if local:
+            options.append("--in-local-dc")
+
+        if partitioner_range:
+            options.append("--partitioner-range")
+
+        if start_token:
+            options.append("--start-token")
+            options.append(f"{start_token}")


end_token/start_token are checked with truthiness (if end_token: / if start_token:). Token values can legitimately be 0 (or other falsy values), which would incorrectly omit the corresponding CLI flags. Use explicit is not None checks for these optional integers.

Copilot · 2026-02-10T10:58:09Z

ccmlib/scylla_node.py

        # Lets search for default overrides in SCYLLA_EXT_OPTS
        env_args = process_opts(os.getenv('SCYLLA_EXT_OPTS', "").split())

+        smp = self._smp
+        memory_in_mb = None


The PR title/description focuses on not preserving --smp/--memory across restarts, but this PR also introduces a large set of unrelated changes (packaging migration to pyproject.toml, removal of SNI proxy support, new SSL helpers, new repair APIs, etc.). This scope mismatch makes the change set harder to review and increases risk; consider splitting into focused PRs.

Copilot · 2026-02-10T10:58:09Z

ssl/ccm_node.key

+-----BEGIN PRIVATE KEY-----
+MIIG/AIBADANBgkqhkiG9w0BAQEFAASCBuYwggbiAgEAAoIBgQCqpkPYb3OTv3xH
+xAS4QLlzOTAmjtooXJs67XEQEh41rBU631PZa0fnVKHrYCpXmXgEJCFbWYky3tLT
+vrN04ry2HNlGJxjSvXBlzpaHDYIWwga9D4PPdYvLDN2Dd/BNLG4/slyBAK6yowIu
+raxgtHSEZqqE6qF2ZJep9hvV803lXK7JqckhGxwEUPldEDUzn7Fvs829ipczkGZN


This file adds an unencrypted private key to the repository. Even if intended only for local testing, committing private keys is risky and can encourage insecure defaults if reused by users. Prefer generating test certificates/keys at runtime (or storing them under a clearly-scoped test fixture directory with prominent documentation), and ensure they are never used as a default for real clusters.

bhalevy · 2026-02-10T11:05:35Z

@fruch does the target branch needs to be master or next?
It looks like master diverged from next after 5392dd6

bhalevy · 2026-02-10T11:07:57Z

BYO test in https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1867/

fruch · 2026-02-10T11:08:21Z

@fruch does the target branch needs to be master or next? It looks like master diverged from next after 5392dd6

ccm isn't part of the next workflow, since it's now a submodule of dtest, hence master is the right one

fruch · 2026-02-10T11:09:12Z

@bhalevy

this iis basically a revert of
#714 ?

if so, can you explain what the issue with that ? i.e. there are test breaking cause of that ? why it's wrong ?

bhalevy · 2026-02-10T11:19:37Z

@bhalevy

this iis basically a revert of #714 ?

if so, can you explain what the issue with that ? i.e. there are test breaking cause of that ? why it's wrong ?

yes, it basically reverts #714.
First, I'm concerned that #714 might cause dtest failures for tests that do not expect smp to be preserved across restarts, when it is passed in jvm_args.

The root issue is that the jvm_args were not supposed to be preserved in the node state, but rather used only for the specific call to start(). The fact that _memory was preserved was a mistake on my part and it was never documented as such. The patch now only sets it for reporting purposes so we can return in by ScyllaNode.memory(), when it is calculated by smp * mem_mb_per_cpu. I pretty much kept the logic for seeding those value from the SCYLLA_EXT_OPTS environment that may provide defaults, and those could be overridden by the jvm_args command line options.

bhalevy · 2026-02-10T11:19:59Z

Update BYO job: https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871/

fruch · 2026-02-10T11:40:32Z

@bhalevy
this iis basically a revert of #714 ?
if so, can you explain what the issue with that ? i.e. there are test breaking cause of that ? why it's wrong ?

yes, it basically reverts #714. First, I'm concerned that #714 might cause dtest failures for tests that do not expect smp to be preserved across restarts, when it is passed in jvm_args.

The root issue is that the jvm_args were not supposed to be preserved in the node state, but rather used only for the specific call to start(). The fact that _memory was preserved was a mistake on my part and it was never documented as such. The patch now only sets it for reporting purposes so we can return in by ScyllaNode.memory(), when it is calculated by smp * mem_mb_per_cpu. I pretty much kept the logic for seeding those value from the SCYLLA_EXT_OPTS environment that may provide defaults, and those could be overridden by the jvm_args command line options.

the clime in #714 is exactly the opposite, that some tests are changing smp during restarts, and that wasn't their intention, so if reverting tests should be fixed first to use apis as you suggest (i.e. all test that are restarting nodes needs to be review)

so I'm not sure what's better, do you concrete example #714 is causing a problem for a test ?

…cmd_line If a test wants to preserve memory and/or smp across node restarts, it should use the methods that were intended for this: set_smp and set_mem_mb_per_cpu. Preserving those implicitly when passed to start was not intended. Fixes: QATOOLS-138 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

bhalevy · 2026-02-10T11:41:38Z

Add Fixes: QATOOLS-138

fruch · 2026-02-10T11:45:49Z

Add Fixes: QATOOLS-138

putting this in comment doesn't help (should be in PR description, commit, headline or branch name
the correct place to report it is DTEST, with qa component ccm

bhalevy · 2026-02-10T12:59:43Z

Add Fixes: QATOOLS-138

putting this in comment doesn't help (should be in PR description, commit, headline or branch name

It is both in the PR description and in the commit.

the correct place to report it is DTEST, with qa component ccm

Okay, can you move QATOOLS-138 there or do you want me to open one in DTEST and close the one in QATOOLS,
or should we just leave it as is and learn for next time.

bhalevy · 2026-02-10T13:10:38Z

Update BYO job: jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871

https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871/testReport/

All scylla dtests passed.
There are some scylla-manager failures unrelated to this PR
- https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871/testReport/manager_repair_tests/TestScyllaMgmtRepair/FullDtest___full_split008___test_ignore_down_hosts_with_one_down_host/
  - dtest_class.WaitTimeoutExpiredError: Wait for: Waiting until task: repair/221755b1-176a-4e13-ad75-206e3621f7d6 reaches status of: [<TaskStatus.ERROR: 'ERROR'>, <TaskStatus.STOPPED: 'STOPPED'>, <TaskStatus.DONE: 'DONE'>, <TaskStatus.ABORTED: 'ABORTED'>]: timeout - 600 seconds - expired
  - Unclear why, node reports INFO 2026-02-10 12:13:47,420 [shard 0:strm] repair - repair[0d996146-16a0-45f9-a500-26b1eb02da10]: completed successfully
- All other SM restore tests failed due to:
  - Cause: restore data: create audit (audit) with CREATE KEYSPACE audit WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true AND tablets = {'enabled': false}: Cannot add existing keyspace "audit" (potentially executed: false)

bhalevy · 2026-02-10T13:11:38Z

Update BYO job: jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871

jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871/testReport

All scylla dtests passed.

There are some scylla-manager failures unrelated to this PR

jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871/testReport/manager_repair_tests/TestScyllaMgmtRepair/FullDtest___full_split008___test_ignore_down_hosts_with_one_down_host

dtest_class.WaitTimeoutExpiredError: Wait for: Waiting until task: repair/221755b1-176a-4e13-ad75-206e3621f7d6 reaches status of: [<TaskStatus.ERROR: 'ERROR'>, <TaskStatus.STOPPED: 'STOPPED'>, <TaskStatus.DONE: 'DONE'>, <TaskStatus.ABORTED: 'ABORTED'>]: timeout - 600 seconds - expired

Unclear why, node reports INFO 2026-02-10 12:13:47,420 [shard 0:strm] repair - repair[0d996146-16a0-45f9-a500-26b1eb02da10]: completed successfully

All other SM restore tests failed due to:

Cause: restore data: create audit (audit) with CREATE KEYSPACE audit WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '3'} AND durable_writes = true AND tablets = {'enabled': false}: Cannot add existing keyspace "audit" (potentially executed: false)

Cc @Michal-Leszczynski

Michal-Leszczynski · 2026-02-10T13:56:48Z

SM restore failure is fixed by scylladb/scylla-manager#4753 which I'm planning to merge today and release this week.

In terms of the repair failure, it looks similar to what was described in scylladb/scylla-manager#4529 (comment) because:

we are using tablet repair API
we have node down
we have logs implicating that repair finished successfully repair - repair[0d996146-16a0-45f9-a500-26b1eb02da10]: completed successfully

From my POV SM behaves as expected, as it just waits for tablet repair task to finish.
The problem is that tablet repair task hangs on raft read barrier after the actual repair has already finished.
I don't think that we consider this to be a bug on scylla side, it's kind of expected behavior (cc: @Deexie),
so it's the test that needs to use vnodes or bring the node back up at some point, so that raft read barrier is unblocked.

Deexie · 2026-02-10T14:04:11Z

SM restore failure is fixed by scylladb/scylla-manager#4753 which I'm planning to merge today and release this week.

In terms of the repair failure, it looks similar to what was described in scylladb/scylla-manager#4529 (comment) because:
* we are using tablet repair API

* we have node down

* we have logs implicating that repair finished successfully `repair - repair[0d996146-16a0-45f9-a500-26b1eb02da10]: completed successfully`
From my POV SM behaves as expected, as it just waits for tablet repair task to finish. The problem is that tablet repair task hangs on raft read barrier after the actual repair has already finished. I don't think that we consider this to be a bug on scylla side, it's kind of expected behavior (cc: @Deexie), so it's the test that needs to use vnodes or bring the node back up at some point, so that raft read barrier is unblocked.

Yes, tablet repair may never finish if the node is down

paszkow

I think we should first go over the dtests tests and call set_smp() and set_memory() whenever necessary and then get this in.

paszkow · 2026-02-10T14:17:54Z

ccmlib/scylla_node.py

        # use '--memory' in jmv_args if mem_mb_per_cpu was not set by the test
        if not self._mem_mb_set_during_test and '--memory' in cmd_args:
-            self._memory = self.parse_size(cmd_args['--memory'][0])
+            memory_in_mb = self.parse_size(cmd_args['--memory'][0]) // MB


Don't you have to compute self._conf_mem_mb_per_cpu = int(memory_in_mb / smp) same way as above when parsing env_args?

no, since memory_in_mb is derived from self._conf_mem_mb_per_cpu only when not provided in any other way.
We set self._conf_mem_mb_per_cpu above only when smp and memory are given by the SCYLLA_EXT_OPTS environment variable (in contrast to jvm_args) and it is not set by the test.

bhalevy · 2026-02-10T14:44:00Z

I think we should first go over the dtests tests and call set_smp() and set_memory() whenever necessary and then get this in.

I'd appreciate if you send a patch to do that.

fruch · 2026-02-10T15:05:00Z

Add Fixes: QATOOLS-138

putting this in comment doesn't help (should be in PR description, commit, headline or branch name

It is both in the PR description and in the commit.

the correct place to report it is DTEST, with qa component ccm

Okay, can you move QATOOLS-138 there or do you want me to open one in DTEST and close the one in QATOOLS, or should we just leave it as is and learn for next time.

I'll move it in jira

roydahan · 2026-02-12T15:26:05Z

This should wait after the merge freeze, it's not a CI-stability issue.

bhalevy requested review from Copilot, fruch and paszkow February 10, 2026 10:48

bhalevy changed the base branch from master to next February 10, 2026 10:48

Copilot started reviewing on behalf of bhalevy February 10, 2026 10:49 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

bhalevy force-pushed the make-smp-and-memory-cmd-line-args-transient branch from f46aee0 to 97e313d Compare February 10, 2026 11:02

bhalevy changed the base branch from next to master February 10, 2026 11:04

bhalevy force-pushed the make-smp-and-memory-cmd-line-args-transient branch from 97e313d to a56a2c0 Compare February 10, 2026 11:15

bhalevy force-pushed the make-smp-and-memory-cmd-line-args-transient branch 2 times, most recently from 07cd6c6 to 04dfa0d Compare February 10, 2026 11:35

bhalevy force-pushed the make-smp-and-memory-cmd-line-args-transient branch from 04dfa0d to ef53107 Compare February 10, 2026 11:41

paszkow reviewed Feb 10, 2026

View reviewed changes

		if max_matches is None:
		max_matches = int(os.environ.get('DTEST_MAX_LOG_MATCHES', '1000'))

		@lru_cache(maxsize=None)
		def setup(version, verbose=True, skip_downloads=False):

Conversation

bhalevy commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

bhalevy commented Feb 10, 2026

Uh oh!

bhalevy commented Feb 10, 2026

Uh oh!

fruch commented Feb 10, 2026

Uh oh!

fruch commented Feb 10, 2026

Uh oh!

bhalevy commented Feb 10, 2026

Uh oh!

bhalevy commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fruch commented Feb 10, 2026

Uh oh!

bhalevy commented Feb 10, 2026

Uh oh!

fruch commented Feb 10, 2026

Uh oh!

bhalevy commented Feb 10, 2026

Uh oh!

bhalevy commented Feb 10, 2026

Uh oh!

bhalevy commented Feb 10, 2026

Uh oh!

Michal-Leszczynski commented Feb 10, 2026

Uh oh!

Deexie commented Feb 10, 2026

Uh oh!

paszkow left a comment

Choose a reason for hiding this comment

Uh oh!

paszkow Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

bhalevy Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bhalevy commented Feb 10, 2026

Uh oh!

fruch commented Feb 10, 2026

Uh oh!

roydahan commented Feb 12, 2026

Uh oh!

bhalevy commented Feb 10, 2026 •

edited

Loading

bhalevy commented Feb 10, 2026 •

edited

Loading

bhalevy Feb 10, 2026 •

edited

Loading