Skip to content

fix(nemesis): use dedicated keyspace for refresh/load-and-stream to avoid stress conflicts#15253

Open
fruch wants to merge 1 commit into
scylladb:masterfrom
fruch:fix_SCT-555
Open

fix(nemesis): use dedicated keyspace for refresh/load-and-stream to avoid stress conflicts#15253
fruch wants to merge 1 commit into
scylladb:masterfrom
fruch:fix_SCT-555

Conversation

@fruch

@fruch fruch commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Fixes SCT-555

Refresh and load-and-stream nemesis operations use keyspace1.standard1 which is the same table used by the main stress command (cassandra-stress). This causes conflicts — the nemesis operations interfere with ongoing stress writes, leading to failures.

What

Move refresh/load-and-stream nemesis to use a dedicated keyspace_refresh.standard1 table instead of keyspace1.standard1, so these nemesis operations no longer conflict with the main stress workload.

Changes

  • sdcm/nemesis/__init__.py — switch disrupt_nodetool_refresh and disrupt_load_and_stream from keyspace1 to keyspace_refresh; pass explicit keyspace to _prepare_test_table and SstableLoadUtils calls
  • sdcm/utils/sstable/load_utils.py — update default keyspace_name parameter from keyspace1 to keyspace_refresh across calculate_columns_count_in_table, upload_sstables, load_and_stream, and run_refresh
  • sdcm/remote/local_cmd_runner.py — add missing user parameter to run method signature
  • unit_tests/integration/test_nemesis_refresh.py — new integration test (test_refresh_monkey_flow) that validates the refresh flow end-to-end using Docker
  • unit_tests/lib/fake_cluster.py — test fixture update
  • unit_tests/test_data/load_and_stream.log — updated test data to match new keyspace name

Testing

uv run python -m pytest unit_tests/integration/test_nemesis_refresh.py -x -q --no-header --tb=short -n0
# 1 passed in 35.30s

Testing

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Renames the default keyspace from keyspace1 to keyspace_refresh across SstableLoadUtils helper method signatures (calculate_columns_count_in_table, upload_sstables, run_load_and_stream, run_refresh, create_keyspace, validate_data_count_after_upload) and updates disrupt_load_and_stream, disrupt_nodetool_refresh, and _prepare_test_table in the nemesis module to target keyspace_refresh.standard1. A new integration test test_refresh_monkey_flow is added. LocalCmdRunner.run gains an optional user parameter. FakeSstableRemoter log line is updated to reflect keyspace_refresh.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

Bug

Suggested reviewers

  • pehala
  • cezarmoise
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: moving refresh/load-and-stream nemesis to a dedicated keyspace to avoid stress conflicts.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description follows the required template and includes summary, changes, and testing evidence; only optional self-review/reminder items are omitted.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@fruch fruch changed the title test(nemesis-refresh): add integration test for refresh monkey flow fix(nemesis): use dedicated keyspace for refresh/load-and-stream to avoid stress conflicts Jun 29, 2026
@fruch

fruch commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @fruch's task in 3m 29s —— View job


Code Review

  • Gather context
  • Read changed files
  • Review sdcm/nemesis/__init__.py
  • Review sdcm/utils/sstable/load_utils.py
  • Review sdcm/remote/local_cmd_runner.py
  • Review unit_tests/integration/test_nemesis_refresh.py
  • Review unit_tests/lib/fake_cluster.py
  • Post review feedback

The core approach is correct — isolating refresh/load-and-stream nemesis to a dedicated keyspace_refresh prevents conflicts with the main cassandra-stress workload. The fix in _prepare_test_table to include keyspace={ks} in the -schema string is required and was missing before. The integration test adds real confidence. However, there are issues with how the default-value changes affect existing callers outside the nemesis path.


Critical Bug

validate_data_count_after_upload: default table_name changed to "standard2"

sdcm/utils/sstable/load_utils.py around line 254:

# Before
def validate_data_count_after_upload(cls, node, keyspace_name: str = "keyspace1", table_name: str = "standard1"):
# After
def validate_data_count_after_upload(cls, node, keyspace_name: str = "keyspace_refresh", table_name: str = "standard2"):

The table_name default changed from "standard1" to "standard2". No code in this PR creates a standard2 table, and the cassandra-stress commands generate standard1. This looks like a typo.

Both callers in sdcm/cluster_k8s/__init__.py:3303,3318 call this method without any arguments:

if int(SstableLoadUtils.validate_data_count_after_upload(node=node)) > 0:   # line 3303
result = SstableLoadUtils.validate_data_count_after_upload(node=node)        # line 3318

After this PR, they query keyspace_refresh.standard2 — a table that never exists — instead of keyspace1.standard1. This will cause runtime errors or silent false-positives in the K8s refresh path. Fix this →


Potential Regression in K8s prefill_cluster

sdcm/cluster_k8s/__init__.py:3286–3319 uses SstableLoadUtils entirely through default arguments. After this PR:

Call Before PR After PR
upload_sstables(node, ...) keyspace1.standard1 keyspace_refresh.standard1
run_refresh(node, ...) keyspace1 standard1 (nodetool) keyspace_refresh standard1 (hardcoded)
validate_data_count_after_upload(node=node) keyspace1.standard1 keyspace_refresh.standard2

But the surrounding code still checks for keyspace1:

test_keyspace_name = "keyspace1"          # line 3295
# ...
create_schema = not (test_keyspace_name in test_keyspaces)  # checks for keyspace1
if not create_schema:
    if int(SstableLoadUtils.validate_data_count_after_upload(node=node)) > 0:  # now checks keyspace_refresh.standard2!
        return  # early-exit logic is broken

The early-exit guard (line 3303) and the assertion (line 3319) are now both checking the wrong keyspace/table. The callers should be updated to pass explicit keyspace_name="keyspace1" and table_name="standard1", or the function should not have its defaults changed if existing callers rely on them.


run_refresh should accept keyspace_name parameter

sdcm/utils/sstable/load_utils.py:163 now hardcodes the keyspace:

node.run_nodetool(sub_cmd="refresh", args="-- keyspace_refresh standard1")

The method has no keyspace_name parameter. This breaks callers that previously used this method for other keyspaces (like the K8s path above). Consider:

@staticmethod
def run_refresh(node, test_data: namedtuple, keyspace_name: str = "keyspace_refresh", table_name: str = "standard1"):
    node.run_nodetool(sub_cmd="refresh", args=f"-- {keyspace_name} {table_name}")

Minor Issues

Integration test: silent pass if cfstats fails

unit_tests/integration/test_nemesis_refresh.py:56:

result = docker_scylla.run_nodetool(sub_cmd="cfstats", args="keyspace_refresh.standard1")
if result is not None and result.exit_status == 0:
    # ... all assertions are in here

If cfstats fails, the test passes without verifying anything. Add an assertion:

assert result is not None and result.exit_status == 0, "cfstats failed"

Integration test docstring typos (test_nemesis_refresh.py:20):

  • "refrash""refresh"
  • "locall""locally"

What's Good

  • The _prepare_test_table fix (adding keyspace={ks} to the cassandra-stress -schema) is a necessary correction and would have caused keyspace_refresh to be silently ignored before.
  • The local_cmd_runner.py user parameter addition is a clean interface fix needed for upload_sstables's sudo(..., user="scylla") call.
  • The integration test covers the actual nodetool refresh flow end-to-end, which is more reliable than unit mocks for this kind of I/O-heavy path.
  • fake_cluster.py and the test data log are correctly updated to match the new keyspace name.

@scylladb-promoter

scylladb-promoter commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

✅ Test Summary: PASSED

✅ Precommit: PASSED

Total Passed Failed Skipped
26 15 0 11

✅ Tests: PASSED

Total Passed Failed Errors Skipped
3669 3638 0 0 31

Full build log

@fruch fruch requested a review from a team June 29, 2026 14:31
Add test_refresh_monkey_flow that exercises the full nodetool refresh
path using a Docker-based Scylla node: writes data with cassandra-stress,
uploads sstables, runs nodetool refresh, and verifies the data is loaded
via CQL query.

Includes related changes to nemesis, sstable load utilities, and test
fixtures needed to support the refresh test flow.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@sdcm/nemesis/__init__.py`:
- Around line 1690-1705: The cfstats precheck in the nemesis flow is silently
skipping the rest of the action when `self.target_node.run_nodetool(...)`
returns a non-zero exit status, so the disruptor can appear successful without
doing any work. Update the relevant nemesis methods in
`sdcm/nemesis/__init__.py` that use this precheck to fail fast instead of
returning normally, either by raising `UnsupportedNemesis` or an assertion, and
keep the existing load/refresh logic in the successful path (`run_nodetool`,
`SstableLoadUtils.run_load_and_stream`, and the surrounding action scope). Also
apply the same behavior to the other matching disruptor block referenced by the
review.

In `@sdcm/utils/sstable/load_utils.py`:
- Around line 253-255: The validate_data_count_after_upload helper is defaulting
to the wrong table for the refresh flow. Update the default table_name in
validate_data_count_after_upload so callers that omit it validate the same table
used by the rest of the keyspace_refresh flow, and check any nearby
callers/tests that rely on the default to ensure they still point to the
intended table.

In `@unit_tests/integration/test_nemesis_refresh.py`:
- Around line 48-73: The integration test in the refresh flow skips all
validation when docker_scylla.run_nodetool("cfstats", ...) fails, so a broken
setup can still pass; update the test around the cfstats precondition to fail
immediately instead of wrapping the rest of the logic in a conditional. Use the
existing result from run_nodetool and the surrounding refresh/assertion block in
test_nemesis_refresh to either assert exit_status == 0 or raise on failure
before running the cqlsh check, upload_sstables, and run_refresh steps.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: df6e74ce-2f61-4e4a-a2a6-7c25b6418d82

📥 Commits

Reviewing files that changed from the base of the PR and between a114adb and 1cc9822.

⛔ Files ignored due to path filters (1)
  • unit_tests/test_data/load_and_stream.log is excluded by !**/*.log
📒 Files selected for processing (5)
  • sdcm/nemesis/__init__.py
  • sdcm/remote/local_cmd_runner.py
  • sdcm/utils/sstable/load_utils.py
  • unit_tests/integration/test_nemesis_refresh.py
  • unit_tests/lib/fake_cluster.py

Comment thread sdcm/nemesis/__init__.py
Comment on lines +1690 to 1705
result = self.target_node.run_nodetool(sub_cmd="cfstats", args="keyspace_refresh.standard1")

if result is not None and result.exit_status == 0:
map_files_to_node = SstableLoadUtils.distribute_test_files_to_cluster_nodes(
nodes=self.cluster.data_nodes, test_data=test_data
)
for sstables_info, load_on_node in map_files_to_node:
self.actions_log.info(f"Uploading sstables to {load_on_node.name}")
SstableLoadUtils.upload_sstables(load_on_node, test_data=sstables_info, table_name="standard1")
SstableLoadUtils.upload_sstables(
load_on_node, test_data=sstables_info, keyspace_name="keyspace_refresh", table_name="standard1"
)
# NOTE: on K8S logs may appear with a delay, so add a bigger timeout for it.
# See https://github.com/scylladb/scylla-cluster-tests/issues/6314
kwargs = {"start_timeout": 1800, "end_timeout": 1800} if self._is_it_on_kubernetes() else {}
with self.action_log_scope(f"Loading and streaming sstables on {load_on_node.name} node"):
SstableLoadUtils.run_load_and_stream(load_on_node, **kwargs)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fail the nemesis when the cfstats precheck fails.

Both disruptors silently return if run_nodetool("cfstats") is non-zero, so the run can be recorded as a success without loading or refreshing anything. Treat this as UnsupportedNemesis or an assertion instead of a no-op.

Also applies to: 1724-1760

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@sdcm/nemesis/__init__.py` around lines 1690 - 1705, The cfstats precheck in
the nemesis flow is silently skipping the rest of the action when
`self.target_node.run_nodetool(...)` returns a non-zero exit status, so the
disruptor can appear successful without doing any work. Update the relevant
nemesis methods in `sdcm/nemesis/__init__.py` that use this precheck to fail
fast instead of returning normally, either by raising `UnsupportedNemesis` or an
assertion, and keep the existing load/refresh logic in the successful path
(`run_nodetool`, `SstableLoadUtils.run_load_and_stream`, and the surrounding
action scope). Also apply the same behavior to the other matching disruptor
block referenced by the review.

Comment on lines +253 to +255
def validate_data_count_after_upload(
cls, node, keyspace_name: str = "keyspace_refresh", table_name: str = "standard2"
):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

validate_data_count_after_upload() now defaults to the wrong table.

Everything else in this flow was moved to keyspace_refresh.standard1, but this helper now defaults to keyspace_refresh.standard2. Any caller that relies on defaults will validate a different table than the one loaded/refreshed and can report a false result.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@sdcm/utils/sstable/load_utils.py` around lines 253 - 255, The
validate_data_count_after_upload helper is defaulting to the wrong table for the
refresh flow. Update the default table_name in validate_data_count_after_upload
so callers that omit it validate the same table used by the rest of the
keyspace_refresh flow, and check any nearby callers/tests that rely on the
default to ensure they still point to the intended table.

Comment on lines +48 to +73
result = docker_scylla.run_nodetool(sub_cmd="cfstats", args="keyspace_refresh.standard1")

if result is not None and result.exit_status == 0:
key = "0x32373131364f334f3830"
# Check one special key before refresh, we will verify refresh by query in the end
# Note: we can't DELETE the key before refresh, otherwise the old sstable won't be loaded
# TRUNCATE can be used the clean the table, but we can't do it for keyspace_refresh.standard1
query_verify = f"SELECT * FROM keyspace_refresh.standard1 WHERE key={key}"
result = docker_scylla.run_cqlsh(query_verify)
if "(0 rows)" in result.stdout:
logging.debug("Key %s does not exist before refresh", key)
else:
logging.debug("Key %s already exists before refresh", key)

# Executing rolling refresh one by one
for node in [docker_scylla]:
SstableLoadUtils.upload_sstables(
node,
test_data=test_data[0],
table_name="standard1",
is_cloud_cluster=False,
)
SstableLoadUtils.run_refresh(node, test_data=test_data[0])
# Verify that the special key is loaded by SELECT query
result = docker_scylla.run_cqlsh(query_verify)
assert "(1 rows)" in result.stdout, f"The key {key} is not loaded by `nodetool refresh`"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

This test passes without testing anything when cfstats fails.

The refresh path and final assertion are skipped entirely unless cfstats returns zero. A broken setup therefore reports green CI instead of a failed test. Assert the precondition or fail immediately.

🧰 Tools
🪛 Ruff (0.15.18)

[error] 55-55: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@unit_tests/integration/test_nemesis_refresh.py` around lines 48 - 73, The
integration test in the refresh flow skips all validation when
docker_scylla.run_nodetool("cfstats", ...) fails, so a broken setup can still
pass; update the test around the cfstats precondition to fail immediately
instead of wrapping the rest of the logic in a conditional. Use the existing
result from run_nodetool and the surrounding refresh/assertion block in
test_nemesis_refresh to either assert exit_status == 0 or raise on failure
before running the cqlsh check, upload_sstables, and run_refresh steps.

@fruch fruch added the test-integration Enable running the integration tests suite label Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants