Skip to content

fix(nemesis): increase CREATE_MV and CREATE_INDEX soft timeouts to 5 hours#15177

Open
yarongilor wants to merge 1 commit into
scylladb:masterfrom
yarongilor:fix_create_index_timeout
Open

fix(nemesis): increase CREATE_MV and CREATE_INDEX soft timeouts to 5 hours#15177
yarongilor wants to merge 1 commit into
scylladb:masterfrom
yarongilor:fix_create_index_timeout

Conversation

@yarongilor

@yarongilor yarongilor commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Observed CREATE_MV taking ~15453s and CREATE_INDEX (wait for index build) taking ~15857s, both exceeding the previous 14400s (4h) soft timeout and triggering SoftTimeoutEvent followed by FailedResultEvent from Argus validation.

Increase the soft timeout from 14400s to 18000s (5h) for all three adaptive_timeout calls in disrupt_create_index, disrupt_add_remove_mv, and disrupt_add_drop_mv_with_node_restarts.

Fixes: https://scylladb.atlassian.net/browse/SCT-353

Testing

  • [ ]

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d8811763-d0b1-40ef-965a-2fd8434b7b68

📥 Commits

Reviewing files that changed from the base of the PR and between 069cddd and ae561de.

📒 Files selected for processing (1)
  • sdcm/nemesis/__init__.py
✅ Files skipped from review due to trivial changes (1)
  • sdcm/nemesis/init.py

📝 Walkthrough

Walkthrough

This change increases the adaptive_timeout wait limit from 14400 seconds to 18000 seconds in three nemesis flows: index creation, materialized view creation during add/remove MV disruption, and materialized view creation in the MV-building coordinator restart loop.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested labels

Bug

Suggested reviewers

  • pehala
  • jsmolar
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately describes the main change: increasing soft timeouts for CREATE_MV and CREATE_INDEX operations from 4 to 5 hours.
Description check ✅ Passed Description includes problem statement, solution, affected functions, Jira reference, and completed self-review checkboxes, though testing section is empty.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@scylladb-promoter

scylladb-promoter commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

✅ Test Summary: PASSED

✅ Precommit: PASSED

Total Passed Failed Skipped
26 15 0 11

✅ Tests: PASSED

Total Passed Failed Errors Skipped
3654 3623 0 0 31

Full build log

Comment thread unit_tests/unit/nemesis/monkey/test_network.py

@fruch fruch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://scylladb.atlassian.net/browse/SCT-353 is a bunch of reports from all kind of causes and release.

I don't see in any of those the analysis that suggests we should wait longer, but we should focus of what is slowing it down (disk/cpu/tombstones), and fix the source of the problem, waiting more is waste of our time and resources.

we doing work to do those wait by factor of something base on feature, we should align those as well to something, it should be increased on every slight issue.

…hours

Observed CREATE_MV taking ~15453s and CREATE_INDEX (wait for
index build) taking ~15857s, both exceeding the previous 14400s
(4h) soft timeout and triggering SoftTimeoutEvent followed by
FailedResultEvent from Argus validation.

Increase the soft timeout from 14400s to 18000s (5h) for all
three adaptive_timeout calls in disrupt_create_index,
disrupt_add_remove_mv, and disrupt_add_drop_mv_with_node_restarts.

Fixes: https://scylladb.atlassian.net/browse/SCT-353
@yarongilor yarongilor force-pushed the fix_create_index_timeout branch from 069cddd to ae561de Compare June 23, 2026 14:45
@yarongilor

Copy link
Copy Markdown
Contributor Author

https://scylladb.atlassian.net/browse/SCT-353 is a bunch of reports from all kind of causes and release.

I don't see in any of those the analysis that suggests we should wait longer, but we should focus of what is slowing it down (disk/cpu/tombstones), and fix the source of the problem, waiting more is waste of our time and resources.

we doing work to do those wait by factor of something base on feature, we should align those as well to something, it should be increased on every slight issue.

@fruch ,

  1. There are some long on-going/stuck discussions about MV performance, like in https://scylladb.atlassian.net/browse/SCYLLADB-2851.
  2. The issue is reproduced across multiple different scylla versions, so it's not a new regression issue on master or something.
  3. I think at the moment, no point continue getting these errors failing tests since it doesn't have any added value. And affects multiple versions/pipelines.
  4. i can open a new task to deeper investigate the add-remove-mv nemesis and it's results to see a clearer pattern of the failure.
  5. This PR is not too dramatic, only allow MV building to last a bit more than the current 4 hours (up to 5 hours).
  6. Some reproduction reports on https://scylladb.atlassian.net/browse/SCT-353 are not relevant so i added a comment for that as well.
  7. We might want to discuss with Dev what would be an agreed way to prove a real MV performance issue.
  8. So we have to decide if it's better to merge this one or put on hold until getting more resolutions.

@fruch

fruch commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

https://scylladb.atlassian.net/browse/SCT-353 is a bunch of reports from all kind of causes and release.

I don't see in any of those the analysis that suggests we should wait longer, but we should focus of what is slowing it down (disk/cpu/tombstones), and fix the source of the problem, waiting more is waste of our time and resources.

we doing work to do those wait by factor of something base on feature, we should align those as well to something, it should be increased on every slight issue.

@fruch ,

  1. There are some long on-going/stuck discussions about MV performance, like in https://scylladb.atlassian.net/browse/SCYLLADB-2851.
  2. The issue is reproduced across multiple different scylla versions, so it's not a new regression issue on master or something.
  3. I think at the moment, no point continue getting these errors failing tests since it doesn't have any added value. And affects multiple versions/pipelines.
  4. i can open a new task to deeper investigate the add-remove-mv nemesis and it's results to see a clearer pattern of the failure.
  5. This PR is not too dramatic, only allow MV building to last a bit more than the current 4 hours (up to 5 hours).
  6. Some reproduction reports on https://scylladb.atlassian.net/browse/SCT-353 are not relevant so i added a comment for that as well.
  7. We might want to discuss with Dev what would be an agreed way to prove a real MV performance issue.
  8. So we have to decide if it's better to merge this one or put on hold until getting more resolutions.

It's continuing now, you are just getting notifications about the soft limit
I.e. it tells exactly how long it took above the soft timeout

Charging this would be to bury the discussions.

It's clearly in some of the reproduce runs, that the case doesn't have enough CPU, and the MV is probably doesn't have enough resources to be able to do it in a timely fashion, we should run longevity with a close to 100% CPU utilization, its one of many things that wouldn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants