fix(nemesis): increase CREATE_MV and CREATE_INDEX soft timeouts to 5 hours by yarongilor · Pull Request #15177 · scylladb/scylla-cluster-tests

yarongilor · 2026-06-23T13:08:01Z

Observed CREATE_MV taking ~15453s and CREATE_INDEX (wait for index build) taking ~15857s, both exceeding the previous 14400s (4h) soft timeout and triggering SoftTimeoutEvent followed by FailedResultEvent from Argus validation.

Increase the soft timeout from 14400s to 18000s (5h) for all three adaptive_timeout calls in disrupt_create_index, disrupt_add_remove_mv, and disrupt_add_drop_mv_with_node_restarts.

Fixes: https://scylladb.atlassian.net/browse/SCT-353

Testing

[ ]

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

coderabbitai · 2026-06-23T13:12:03Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d8811763-d0b1-40ef-965a-2fd8434b7b68

📥 Commits

Reviewing files that changed from the base of the PR and between 069cddd and ae561de.

📒 Files selected for processing (1)

sdcm/nemesis/__init__.py

✅ Files skipped from review due to trivial changes (1)

sdcm/nemesis/init.py

📝 Walkthrough

Walkthrough

This change increases the adaptive_timeout wait limit from 14400 seconds to 18000 seconds in three nemesis flows: index creation, materialized view creation during add/remove MV disruption, and materialized view creation in the MV-building coordinator restart loop.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested labels

Bug

Suggested reviewers

pehala
jsmolar

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title accurately describes the main change: increasing soft timeouts for CREATE_MV and CREATE_INDEX operations from 4 to 5 hours.
Description check	✅ Passed	Description includes problem statement, solution, affected functions, Jira reference, and completed self-review checkboxes, though testing section is empty.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

scylladb-promoter · 2026-06-23T13:16:42Z

✅ Test Summary: PASSED

✅ Precommit: PASSED

Total	Passed	Failed	Skipped
26	15	0	11

✅ Tests: PASSED

Total	Passed	Failed	Errors	Skipped
3654	3623	0	0	31

Full build log

fruch

https://scylladb.atlassian.net/browse/SCT-353 is a bunch of reports from all kind of causes and release.

I don't see in any of those the analysis that suggests we should wait longer, but we should focus of what is slowing it down (disk/cpu/tombstones), and fix the source of the problem, waiting more is waste of our time and resources.

we doing work to do those wait by factor of something base on feature, we should align those as well to something, it should be increased on every slight issue.

…hours Observed CREATE_MV taking ~15453s and CREATE_INDEX (wait for index build) taking ~15857s, both exceeding the previous 14400s (4h) soft timeout and triggering SoftTimeoutEvent followed by FailedResultEvent from Argus validation. Increase the soft timeout from 14400s to 18000s (5h) for all three adaptive_timeout calls in disrupt_create_index, disrupt_add_remove_mv, and disrupt_add_drop_mv_with_node_restarts. Fixes: https://scylladb.atlassian.net/browse/SCT-353

yarongilor · 2026-06-23T15:54:40Z

https://scylladb.atlassian.net/browse/SCT-353 is a bunch of reports from all kind of causes and release.

I don't see in any of those the analysis that suggests we should wait longer, but we should focus of what is slowing it down (disk/cpu/tombstones), and fix the source of the problem, waiting more is waste of our time and resources.

we doing work to do those wait by factor of something base on feature, we should align those as well to something, it should be increased on every slight issue.

@fruch ,

There are some long on-going/stuck discussions about MV performance, like in https://scylladb.atlassian.net/browse/SCYLLADB-2851.
The issue is reproduced across multiple different scylla versions, so it's not a new regression issue on master or something.
I think at the moment, no point continue getting these errors failing tests since it doesn't have any added value. And affects multiple versions/pipelines.
i can open a new task to deeper investigate the add-remove-mv nemesis and it's results to see a clearer pattern of the failure.
This PR is not too dramatic, only allow MV building to last a bit more than the current 4 hours (up to 5 hours).
Some reproduction reports on https://scylladb.atlassian.net/browse/SCT-353 are not relevant so i added a comment for that as well.
We might want to discuss with Dev what would be an agreed way to prove a real MV performance issue.
So we have to decide if it's better to merge this one or put on hold until getting more resolutions.

fruch · 2026-06-23T16:19:17Z

https://scylladb.atlassian.net/browse/SCT-353 is a bunch of reports from all kind of causes and release.

I don't see in any of those the analysis that suggests we should wait longer, but we should focus of what is slowing it down (disk/cpu/tombstones), and fix the source of the problem, waiting more is waste of our time and resources.

we doing work to do those wait by factor of something base on feature, we should align those as well to something, it should be increased on every slight issue.

@fruch ,

There are some long on-going/stuck discussions about MV performance, like in https://scylladb.atlassian.net/browse/SCYLLADB-2851.

The issue is reproduced across multiple different scylla versions, so it's not a new regression issue on master or something.

I think at the moment, no point continue getting these errors failing tests since it doesn't have any added value. And affects multiple versions/pipelines.

i can open a new task to deeper investigate the add-remove-mv nemesis and it's results to see a clearer pattern of the failure.

This PR is not too dramatic, only allow MV building to last a bit more than the current 4 hours (up to 5 hours).

Some reproduction reports on https://scylladb.atlassian.net/browse/SCT-353 are not relevant so i added a comment for that as well.

We might want to discuss with Dev what would be an agreed way to prove a real MV performance issue.

So we have to decide if it's better to merge this one or put on hold until getting more resolutions.

It's continuing now, you are just getting notifications about the soft limit
I.e. it tells exactly how long it took above the soft timeout

Charging this would be to bury the discussions.

It's clearly in some of the reproduce runs, that the case doesn't have enough CPU, and the MV is probably doesn't have enough resources to be able to do it in a timely fashion, we should run longevity with a close to 100% CPU utilization, its one of many things that wouldn't work.

yarongilor requested a review from pehala June 23, 2026 13:08

yarongilor added backport/2025.4 backport/2026.1 backport/2026.2 labels Jun 23, 2026

github-actions Bot assigned yarongilor Jun 23, 2026

github-actions Bot added the P3 Medium Priority label Jun 23, 2026

yarongilor requested review from cezarmoise and jsmolar June 23, 2026 13:10

fruch reviewed Jun 23, 2026

View reviewed changes

Comment thread unit_tests/unit/nemesis/monkey/test_network.py

fruch requested changes Jun 23, 2026

View reviewed changes

yarongilor force-pushed the fix_create_index_timeout branch from 069cddd to ae561de Compare June 23, 2026 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(nemesis): increase CREATE_MV and CREATE_INDEX soft timeouts to 5 hours#15177

fix(nemesis): increase CREATE_MV and CREATE_INDEX soft timeouts to 5 hours#15177
yarongilor wants to merge 1 commit into
scylladb:masterfrom
yarongilor:fix_create_index_timeout

yarongilor commented Jun 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

scylladb-promoter commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

fruch left a comment

Uh oh!

yarongilor commented Jun 23, 2026

Uh oh!

fruch commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yarongilor commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

PR pre-checks (self review)

Reminders

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

scylladb-promoter commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Test Summary: PASSED

✅ Precommit: PASSED

✅ Tests: PASSED

Uh oh!

Uh oh!

fruch left a comment

Choose a reason for hiding this comment

Uh oh!

yarongilor commented Jun 23, 2026

Uh oh!

fruch commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yarongilor commented Jun 23, 2026 •

edited

Loading

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

scylladb-promoter commented Jun 23, 2026 •

edited

Loading