fix(nemesis): use dedicated keyspace for refresh/load-and-stream to avoid stress conflicts#15253
fix(nemesis): use dedicated keyspace for refresh/load-and-stream to avoid stress conflicts#15253fruch wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthroughRenames the default keyspace from Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @fruch's task in 3m 29s —— View job Code Review
The core approach is correct — isolating refresh/load-and-stream nemesis to a dedicated Critical Bug
# Before
def validate_data_count_after_upload(cls, node, keyspace_name: str = "keyspace1", table_name: str = "standard1"):
# After
def validate_data_count_after_upload(cls, node, keyspace_name: str = "keyspace_refresh", table_name: str = "standard2"):The Both callers in if int(SstableLoadUtils.validate_data_count_after_upload(node=node)) > 0: # line 3303
result = SstableLoadUtils.validate_data_count_after_upload(node=node) # line 3318After this PR, they query Potential Regression in K8s
|
| Call | Before PR | After PR |
|---|---|---|
upload_sstables(node, ...) |
keyspace1.standard1 |
keyspace_refresh.standard1 |
run_refresh(node, ...) |
keyspace1 standard1 (nodetool) |
keyspace_refresh standard1 (hardcoded) |
validate_data_count_after_upload(node=node) |
keyspace1.standard1 |
keyspace_refresh.standard2 |
But the surrounding code still checks for keyspace1:
test_keyspace_name = "keyspace1" # line 3295
# ...
create_schema = not (test_keyspace_name in test_keyspaces) # checks for keyspace1
if not create_schema:
if int(SstableLoadUtils.validate_data_count_after_upload(node=node)) > 0: # now checks keyspace_refresh.standard2!
return # early-exit logic is brokenThe early-exit guard (line 3303) and the assertion (line 3319) are now both checking the wrong keyspace/table. The callers should be updated to pass explicit keyspace_name="keyspace1" and table_name="standard1", or the function should not have its defaults changed if existing callers rely on them.
run_refresh should accept keyspace_name parameter
sdcm/utils/sstable/load_utils.py:163 now hardcodes the keyspace:
node.run_nodetool(sub_cmd="refresh", args="-- keyspace_refresh standard1")The method has no keyspace_name parameter. This breaks callers that previously used this method for other keyspaces (like the K8s path above). Consider:
@staticmethod
def run_refresh(node, test_data: namedtuple, keyspace_name: str = "keyspace_refresh", table_name: str = "standard1"):
node.run_nodetool(sub_cmd="refresh", args=f"-- {keyspace_name} {table_name}")Minor Issues
Integration test: silent pass if cfstats fails
unit_tests/integration/test_nemesis_refresh.py:56:
result = docker_scylla.run_nodetool(sub_cmd="cfstats", args="keyspace_refresh.standard1")
if result is not None and result.exit_status == 0:
# ... all assertions are in hereIf cfstats fails, the test passes without verifying anything. Add an assertion:
assert result is not None and result.exit_status == 0, "cfstats failed"Integration test docstring typos (test_nemesis_refresh.py:20):
"refrash"→"refresh""locall"→"locally"
What's Good
- The
_prepare_test_tablefix (addingkeyspace={ks}to the cassandra-stress-schema) is a necessary correction and would have causedkeyspace_refreshto be silently ignored before. - The
local_cmd_runner.pyuserparameter addition is a clean interface fix needed forupload_sstables'ssudo(..., user="scylla")call. - The integration test covers the actual nodetool refresh flow end-to-end, which is more reliable than unit mocks for this kind of I/O-heavy path.
fake_cluster.pyand the test data log are correctly updated to match the new keyspace name.
✅ Test Summary: PASSED✅ Precommit: PASSED
✅ Tests: PASSED
|
Add test_refresh_monkey_flow that exercises the full nodetool refresh path using a Docker-based Scylla node: writes data with cassandra-stress, uploads sstables, runs nodetool refresh, and verifies the data is loaded via CQL query. Includes related changes to nemesis, sstable load utilities, and test fixtures needed to support the refresh test flow.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@sdcm/nemesis/__init__.py`:
- Around line 1690-1705: The cfstats precheck in the nemesis flow is silently
skipping the rest of the action when `self.target_node.run_nodetool(...)`
returns a non-zero exit status, so the disruptor can appear successful without
doing any work. Update the relevant nemesis methods in
`sdcm/nemesis/__init__.py` that use this precheck to fail fast instead of
returning normally, either by raising `UnsupportedNemesis` or an assertion, and
keep the existing load/refresh logic in the successful path (`run_nodetool`,
`SstableLoadUtils.run_load_and_stream`, and the surrounding action scope). Also
apply the same behavior to the other matching disruptor block referenced by the
review.
In `@sdcm/utils/sstable/load_utils.py`:
- Around line 253-255: The validate_data_count_after_upload helper is defaulting
to the wrong table for the refresh flow. Update the default table_name in
validate_data_count_after_upload so callers that omit it validate the same table
used by the rest of the keyspace_refresh flow, and check any nearby
callers/tests that rely on the default to ensure they still point to the
intended table.
In `@unit_tests/integration/test_nemesis_refresh.py`:
- Around line 48-73: The integration test in the refresh flow skips all
validation when docker_scylla.run_nodetool("cfstats", ...) fails, so a broken
setup can still pass; update the test around the cfstats precondition to fail
immediately instead of wrapping the rest of the logic in a conditional. Use the
existing result from run_nodetool and the surrounding refresh/assertion block in
test_nemesis_refresh to either assert exit_status == 0 or raise on failure
before running the cqlsh check, upload_sstables, and run_refresh steps.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: df6e74ce-2f61-4e4a-a2a6-7c25b6418d82
⛔ Files ignored due to path filters (1)
unit_tests/test_data/load_and_stream.logis excluded by!**/*.log
📒 Files selected for processing (5)
sdcm/nemesis/__init__.pysdcm/remote/local_cmd_runner.pysdcm/utils/sstable/load_utils.pyunit_tests/integration/test_nemesis_refresh.pyunit_tests/lib/fake_cluster.py
| result = self.target_node.run_nodetool(sub_cmd="cfstats", args="keyspace_refresh.standard1") | ||
|
|
||
| if result is not None and result.exit_status == 0: | ||
| map_files_to_node = SstableLoadUtils.distribute_test_files_to_cluster_nodes( | ||
| nodes=self.cluster.data_nodes, test_data=test_data | ||
| ) | ||
| for sstables_info, load_on_node in map_files_to_node: | ||
| self.actions_log.info(f"Uploading sstables to {load_on_node.name}") | ||
| SstableLoadUtils.upload_sstables(load_on_node, test_data=sstables_info, table_name="standard1") | ||
| SstableLoadUtils.upload_sstables( | ||
| load_on_node, test_data=sstables_info, keyspace_name="keyspace_refresh", table_name="standard1" | ||
| ) | ||
| # NOTE: on K8S logs may appear with a delay, so add a bigger timeout for it. | ||
| # See https://github.com/scylladb/scylla-cluster-tests/issues/6314 | ||
| kwargs = {"start_timeout": 1800, "end_timeout": 1800} if self._is_it_on_kubernetes() else {} | ||
| with self.action_log_scope(f"Loading and streaming sstables on {load_on_node.name} node"): | ||
| SstableLoadUtils.run_load_and_stream(load_on_node, **kwargs) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Fail the nemesis when the cfstats precheck fails.
Both disruptors silently return if run_nodetool("cfstats") is non-zero, so the run can be recorded as a success without loading or refreshing anything. Treat this as UnsupportedNemesis or an assertion instead of a no-op.
Also applies to: 1724-1760
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@sdcm/nemesis/__init__.py` around lines 1690 - 1705, The cfstats precheck in
the nemesis flow is silently skipping the rest of the action when
`self.target_node.run_nodetool(...)` returns a non-zero exit status, so the
disruptor can appear successful without doing any work. Update the relevant
nemesis methods in `sdcm/nemesis/__init__.py` that use this precheck to fail
fast instead of returning normally, either by raising `UnsupportedNemesis` or an
assertion, and keep the existing load/refresh logic in the successful path
(`run_nodetool`, `SstableLoadUtils.run_load_and_stream`, and the surrounding
action scope). Also apply the same behavior to the other matching disruptor
block referenced by the review.
| def validate_data_count_after_upload( | ||
| cls, node, keyspace_name: str = "keyspace_refresh", table_name: str = "standard2" | ||
| ): |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
validate_data_count_after_upload() now defaults to the wrong table.
Everything else in this flow was moved to keyspace_refresh.standard1, but this helper now defaults to keyspace_refresh.standard2. Any caller that relies on defaults will validate a different table than the one loaded/refreshed and can report a false result.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@sdcm/utils/sstable/load_utils.py` around lines 253 - 255, The
validate_data_count_after_upload helper is defaulting to the wrong table for the
refresh flow. Update the default table_name in validate_data_count_after_upload
so callers that omit it validate the same table used by the rest of the
keyspace_refresh flow, and check any nearby callers/tests that rely on the
default to ensure they still point to the intended table.
| result = docker_scylla.run_nodetool(sub_cmd="cfstats", args="keyspace_refresh.standard1") | ||
|
|
||
| if result is not None and result.exit_status == 0: | ||
| key = "0x32373131364f334f3830" | ||
| # Check one special key before refresh, we will verify refresh by query in the end | ||
| # Note: we can't DELETE the key before refresh, otherwise the old sstable won't be loaded | ||
| # TRUNCATE can be used the clean the table, but we can't do it for keyspace_refresh.standard1 | ||
| query_verify = f"SELECT * FROM keyspace_refresh.standard1 WHERE key={key}" | ||
| result = docker_scylla.run_cqlsh(query_verify) | ||
| if "(0 rows)" in result.stdout: | ||
| logging.debug("Key %s does not exist before refresh", key) | ||
| else: | ||
| logging.debug("Key %s already exists before refresh", key) | ||
|
|
||
| # Executing rolling refresh one by one | ||
| for node in [docker_scylla]: | ||
| SstableLoadUtils.upload_sstables( | ||
| node, | ||
| test_data=test_data[0], | ||
| table_name="standard1", | ||
| is_cloud_cluster=False, | ||
| ) | ||
| SstableLoadUtils.run_refresh(node, test_data=test_data[0]) | ||
| # Verify that the special key is loaded by SELECT query | ||
| result = docker_scylla.run_cqlsh(query_verify) | ||
| assert "(1 rows)" in result.stdout, f"The key {key} is not loaded by `nodetool refresh`" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
This test passes without testing anything when cfstats fails.
The refresh path and final assertion are skipped entirely unless cfstats returns zero. A broken setup therefore reports green CI instead of a failed test. Assert the precondition or fail immediately.
🧰 Tools
🪛 Ruff (0.15.18)
[error] 55-55: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@unit_tests/integration/test_nemesis_refresh.py` around lines 48 - 73, The
integration test in the refresh flow skips all validation when
docker_scylla.run_nodetool("cfstats", ...) fails, so a broken setup can still
pass; update the test around the cfstats precondition to fail immediately
instead of wrapping the rest of the logic in a conditional. Use the existing
result from run_nodetool and the surrounding refresh/assertion block in
test_nemesis_refresh to either assert exit_status == 0 or raise on failure
before running the cqlsh check, upload_sstables, and run_refresh steps.
Fixes SCT-555
Refresh and load-and-stream nemesis operations use
keyspace1.standard1which is the same table used by the main stress command (cassandra-stress). This causes conflicts — the nemesis operations interfere with ongoing stress writes, leading to failures.What
Move refresh/load-and-stream nemesis to use a dedicated
keyspace_refresh.standard1table instead ofkeyspace1.standard1, so these nemesis operations no longer conflict with the main stress workload.Changes
sdcm/nemesis/__init__.py— switchdisrupt_nodetool_refreshanddisrupt_load_and_streamfromkeyspace1tokeyspace_refresh; pass explicitkeyspaceto_prepare_test_tableandSstableLoadUtilscallssdcm/utils/sstable/load_utils.py— update defaultkeyspace_nameparameter fromkeyspace1tokeyspace_refreshacrosscalculate_columns_count_in_table,upload_sstables,load_and_stream, andrun_refreshsdcm/remote/local_cmd_runner.py— add missinguserparameter torunmethod signatureunit_tests/integration/test_nemesis_refresh.py— new integration test (test_refresh_monkey_flow) that validates the refresh flow end-to-end using Dockerunit_tests/lib/fake_cluster.py— test fixture updateunit_tests/test_data/load_and_stream.log— updated test data to match new keyspace nameTesting
uv run python -m pytest unit_tests/integration/test_nemesis_refresh.py -x -q --no-header --tb=short -n0 # 1 passed in 35.30sTesting