scylla-node: start: do not preserve smp and memory passed via env or cmd_line#718
scylla-node: start: do not preserve smp and memory passed via env or cmd_line#718bhalevy wants to merge 1 commit intoscylladb:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR changes how Scylla nodes build their startup arguments so --smp/--memory passed via SCYLLA_EXT_OPTS or jvm_args no longer implicitly persist across restarts (tests should use set_smp() / set_mem_mb_per_cpu() instead). In addition, it includes a broad set of refactors and tooling/test additions across CCM (packaging/CI, log parsing limits, config merge semantics, topology defaults, etc.).
Changes:
- Adjust Scylla node startup option handling (normalize short opts; ignore
--smp/--memoryfrom env/cmdline for persistence; always add computed values). - Add/extend utilities and tests (config deep-merge, capped log grep, cluster cleanup behavior, build mode persistence/extraction, improved version error messages).
- Migrate build/test tooling (move to
pyproject.toml/uv, update CI and Nix configuration, remove SNI proxy support/files).
Reviewed changes
Copilot reviewed 47 out of 50 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_version_error_messages.py | Adds tests for clearer version-detection error paths. |
| tests/test_scylla_repository.py | Adjusts repository tests for new caching behavior. |
| tests/test_scylla_ext_opts.py | Adds tests for short-form option normalization in SCYLLA_EXT_OPTS. |
| tests/test_scylla_cmds.py | Removes SNI proxy-related CLI tests. |
| tests/test_max_log_matches.py | Adds tests for limiting log-match collection. |
| tests/test_help.py | Adds a basic CLI help/usage smoke test. |
| tests/test_common.py | Adds tests for config merging and scylla mode persistence. |
| tests/test_cluster_cleanup.py | Adds tests for new cluster_cleanup() behavior. |
| tests/test_cluster_add_cmd.py | Adds tests for JMX port conflict logic when adding nodes. |
| tests/test_build_mode_extraction.py | Adds tests for extracting build mode from relocatable installs. |
| tests/test_add_node_without_datacenter.py | Adds tests for inferring DC/rack when adding nodes. |
| tests/ccmcluster.py | Removes SNI proxy start helper from test harness. |
| ssl/ccm_node.pem | Adds SSL cert material for tests/SSL features. |
| ssl/ccm_node.key | Adds SSL private key material for tests/SSL features. |
| ssl/ccm_node.crl | Adds CRL file alongside SSL test artifacts. |
| ssl/ccm_node.cer | Adds certificate file alongside SSL test artifacts. |
| setup.py | Removes legacy setuptools installer script. |
| requirements-test.txt | Removes legacy test requirements file. |
| pytest.ini | Removes legacy pytest configuration file (migrated to pyproject.toml). |
| pyproject.toml | Introduces PEP 621 packaging + pytest config + uv dev deps. |
| flake.nix | Updates Nix build/dev setup (pyproject format, Python version updates). |
| flake.lock | Updates pinned Nix inputs. |
| ccmlib/utils/sni_proxy.py | Removes SNI proxy implementation. |
| ccmlib/scylla_repository.py | Adds caching + tweaks manager package handling. |
| ccmlib/scylla_node.py | Updates start arg handling, adds option normalization, adds repair/sstable changes. |
| ccmlib/scylla_docker_cluster.py | Adjusts create_node signature; warns on deprecated thrift interface. |
| ccmlib/scylla_cluster.py | Persists scylla mode, adds SSL helper, adjusts repair/timeout behavior, removes SNI proxy integration. |
| ccmlib/resources/docker/sniproxy/Dockerfile | Removes SNI proxy Dockerfile. |
| ccmlib/repository.py | Adds caching to Cassandra repository setup. |
| ccmlib/node.py | Adds deep-merge config behavior; adds log-match limits; host-id parsing improvements. |
| ccmlib/dse_node.py | Adjusts constructor signature consistent with Node changes. |
| ccmlib/dse_cluster.py | Adjusts create_node signature; warns on deprecated thrift interface. |
| ccmlib/common.py | Adds config deep-merge helper; improves version error messages; custom JAVA_HOME support. |
| ccmlib/cmds/node_cmds.py | Removes SNI proxy CLI wiring. |
| ccmlib/cmds/cluster_cmds.py | Updates add/start options (rack option, JMX conflict logic, removes SNI proxy flags). |
| ccmlib/cluster_factory.py | Restores persisted scylla mode/ipprefix; makes seed loading more robust. |
| ccmlib/cluster.py | Adds rack/DC inference on add; adds cluster_cleanup; uses deep-merge for config options. |
| ccmlib/bin/init.py | Adds a Python entrypoint module for ccm. |
| ccm | Simplifies wrapper script to call ccmlib.bin.main. |
| README.md | Reworks docs for Scylla fork, uv usage, docker/reloc workflows. |
| MANIFEST.in | Removes SNI proxy docker resource inclusion. |
| .python-version | Pins Python version for local dev tooling. |
| .gitignore | Ignores uv lockfile and adjusts patterns. |
| .github/workflows/trigger_jenkins.yaml | Removes Jenkins trigger workflow. |
| .github/workflows/nix.yml | Updates Nix CI flow and Python version used. |
| .github/workflows/integration-tests.yml | Migrates CI install to uv and updates Python/Java matrix. |
| .github/workflows/close_issue_for_scylla_employee.yml | Adds automation to comment/close issues under certain conditions. |
| .github/workflows/ci-tests.yml | Tweaks PR branch triggers. |
| .github/workflows/call_jira_sync.yml | Adds Jira sync workflow hooks. |
| .github/copilot-instructions.md | Adds repo-specific Copilot guidance doc. |
| if max_matches is None: | ||
| max_matches = int(os.environ.get('DTEST_MAX_LOG_MATCHES', '1000')) | ||
|
|
There was a problem hiding this comment.
DTEST_MAX_LOG_MATCHES is parsed with int(...) without validation. If the env var is set to a non-integer value, log parsing will raise ValueError and potentially break unrelated code paths. Consider handling invalid values gracefully (e.g., defaulting to 1000 with a warning).
| url = f"http://{self.address()}:{self.api_port}/storage_service/keyspaces" | ||
| resp = requests.get(url=url, params={"replication": "vnodes"}) | ||
| resp.raise_for_status() |
There was a problem hiding this comment.
The HTTP call in repair() uses requests.get(...) without a timeout, which can hang indefinitely if the API endpoint is slow/unresponsive. Add an explicit timeout (and consider handling requests.RequestException) to keep repair operations bounded.
| dependencies = [ | ||
| "ruamel-yaml", | ||
| "psutil", |
There was a problem hiding this comment.
The dependency is listed as "ruamel-yaml", but the project imports ruamel.yaml. The PyPI package name used elsewhere in this repo (previous setup.py) is ruamel.yaml; using the hyphenated name is likely to break installs/uv sync. Update the dependency entry to the correct distribution name.
| @lru_cache(maxsize=None) | ||
| def setup(version, verbose=True, skip_downloads=False): |
There was a problem hiding this comment.
setup() is now @lru_cached, but its behavior depends on process environment (e.g., SCYLLA_MANAGER_PACKAGE, SCYLLA_EXT_OPTS) and it mutates os.environ. Caching only on (version, verbose, skip_downloads) can return stale results and skip required env updates if those env vars change within the same process. Consider removing caching here or moving env-dependent side effects out of the cached function (or include the relevant env inputs in the cache key).
| # Try cluster cleanup on the first running node | ||
| try: | ||
| nodes[0].nodetool("cluster cleanup") | ||
| except NodetoolError: | ||
| # Fallback: run regular cleanup on all nodes except the last (command doesn't exist) | ||
| # The last node added to the cluster doesn't need cleanup | ||
| for node in nodes[:-1]: | ||
| node.nodetool("cleanup") |
There was a problem hiding this comment.
cluster_cleanup() falls back to per-node cleanup on any NodetoolError. That can mask real failures of nodetool cluster cleanup (e.g., auth/network issues) and proceed with a different operation unexpectedly. Tighten the exception handling to only fall back when the error indicates the subcommand is unsupported (e.g., "Unknown command"), and re-raise otherwise.
| if end_token: | ||
| options.append("--end-token") | ||
| options.append(f"{end_token}") | ||
|
|
||
| if local: | ||
| options.append("--in-local-dc") | ||
|
|
||
| if partitioner_range: | ||
| options.append("--partitioner-range") | ||
|
|
||
| if start_token: | ||
| options.append("--start-token") | ||
| options.append(f"{start_token}") |
There was a problem hiding this comment.
end_token/start_token are checked with truthiness (if end_token: / if start_token:). Token values can legitimately be 0 (or other falsy values), which would incorrectly omit the corresponding CLI flags. Use explicit is not None checks for these optional integers.
| # Lets search for default overrides in SCYLLA_EXT_OPTS | ||
| env_args = process_opts(os.getenv('SCYLLA_EXT_OPTS', "").split()) | ||
|
|
||
| smp = self._smp | ||
| memory_in_mb = None |
There was a problem hiding this comment.
The PR title/description focuses on not preserving --smp/--memory across restarts, but this PR also introduces a large set of unrelated changes (packaging migration to pyproject.toml, removal of SNI proxy support, new SSL helpers, new repair APIs, etc.). This scope mismatch makes the change set harder to review and increases risk; consider splitting into focused PRs.
| -----BEGIN PRIVATE KEY----- | ||
| MIIG/AIBADANBgkqhkiG9w0BAQEFAASCBuYwggbiAgEAAoIBgQCqpkPYb3OTv3xH | ||
| xAS4QLlzOTAmjtooXJs67XEQEh41rBU631PZa0fnVKHrYCpXmXgEJCFbWYky3tLT | ||
| vrN04ry2HNlGJxjSvXBlzpaHDYIWwga9D4PPdYvLDN2Dd/BNLG4/slyBAK6yowIu | ||
| raxgtHSEZqqE6qF2ZJep9hvV803lXK7JqckhGxwEUPldEDUzn7Fvs829ipczkGZN |
There was a problem hiding this comment.
This file adds an unencrypted private key to the repository. Even if intended only for local testing, committing private keys is risky and can encourage insecure defaults if reused by users. Prefer generating test certificates/keys at runtime (or storing them under a clearly-scoped test fixture directory with prominent documentation), and ensure they are never used as a default for real clusters.
f46aee0 to
97e313d
Compare
97e313d to
a56a2c0
Compare
yes, it basically reverts #714. The root issue is that the jvm_args were not supposed to be preserved in the node state, but rather used only for the specific call to |
07cd6c6 to
04dfa0d
Compare
the clime in #714 is exactly the opposite, that some tests are changing smp during restarts, and that wasn't their intention, so if reverting tests should be fixed first to use apis as you suggest (i.e. all test that are restarting nodes needs to be review) so I'm not sure what's better, do you concrete example #714 is causing a problem for a test ? |
…cmd_line If a test wants to preserve memory and/or smp across node restarts, it should use the methods that were intended for this: set_smp and set_mem_mb_per_cpu. Preserving those implicitly when passed to start was not intended. Fixes: QATOOLS-138 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
04dfa0d to
ef53107
Compare
|
Add |
|
It is both in the PR description and in the commit.
Okay, can you move QATOOLS-138 there or do you want me to open one in DTEST and close the one in QATOOLS, |
https://jenkins.scylladb.com/view/master/job/scylla-master/job/byo/job/dtest-byo/1871/testReport/
|
|
|
SM restore failure is fixed by scylladb/scylla-manager#4753 which I'm planning to merge today and release this week. In terms of the repair failure, it looks similar to what was described in scylladb/scylla-manager#4529 (comment) because:
From my POV SM behaves as expected, as it just waits for tablet repair task to finish. |
Yes, tablet repair may never finish if the node is down |
paszkow
left a comment
There was a problem hiding this comment.
I think we should first go over the dtests tests and call set_smp() and set_memory() whenever necessary and then get this in.
| # use '--memory' in jmv_args if mem_mb_per_cpu was not set by the test | ||
| if not self._mem_mb_set_during_test and '--memory' in cmd_args: | ||
| self._memory = self.parse_size(cmd_args['--memory'][0]) | ||
| memory_in_mb = self.parse_size(cmd_args['--memory'][0]) // MB |
There was a problem hiding this comment.
Don't you have to compute self._conf_mem_mb_per_cpu = int(memory_in_mb / smp) same way as above when parsing env_args?
There was a problem hiding this comment.
no, since memory_in_mb is derived from self._conf_mem_mb_per_cpu only when not provided in any other way.
We set self._conf_mem_mb_per_cpu above only when smp and memory are given by the SCYLLA_EXT_OPTS environment variable (in contrast to jvm_args) and it is not set by the test.
I'd appreciate if you send a patch to do that. |
I'll move it in jira |
|
This should wait after the merge freeze, it's not a CI-stability issue. |
If a test wants to preserve memory and/or smp across node restarts, it should use the methods that were intended for this: set_smp and set_mem_mb_per_cpu.
Preserving those implicitly when passed to start was not intended.
Fixes: QATOOLS-138