Skip to content

fix(tier1): Restore tuned tier1 instances#15208

Open
pehala wants to merge 1 commit into
scylladb:masterfrom
pehala:revert_tier1_sizinh
Open

fix(tier1): Restore tuned tier1 instances#15208
pehala wants to merge 1 commit into
scylladb:masterfrom
pehala:revert_tier1_sizinh

Conversation

@pehala

@pehala pehala commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

For tests that were tuned to specific disk usage, restore the hardcoded instances so the tuning is adhered to now and in the future. I do not want these tests changing without us knowing. Leave the sizing info so it can be still used cross cloud, just the tuning probably wont be exact, but that is expected.

To make these cases use dynamic sizing, we need to parametrize command based on disk size, which is possible but will require additional effort (plan in progress)

Testing

  • none

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

For tests that were tuned to specific disk usage,
restore the hardcoded instances so it adhered to
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@pehala, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 15 minutes and 37 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 298bb544-c637-431e-ba34-6a36551f8ac2

📥 Commits

Reviewing files that changed from the base of the PR and between dc81ff6 and fda7a34.

📒 Files selected for processing (4)
  • configurations/2TB_high_disk_used_capacity.yaml
  • test-cases/features/elasticity/elasticity-90-percent-with-nemesis.yaml
  • test-cases/longevity/longevity-multidc-parallel-topology-schema-changes-12h.yaml
  • test-cases/longevity/longevity-twcs-48h.yaml
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@scylladb-promoter

Copy link
Copy Markdown
Collaborator

✅ Test Summary: PASSED

✅ Precommit: PASSED

Total Passed Failed Skipped
26 15 0 11

✅ Tests: PASSED

Total Passed Failed Errors Skipped
3667 3636 0 0 31

Full build log

"cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=6800m -mode cql3 native -rate threads=10",
"cassandra-stress read cl=QUORUM duration=6800m -mode cql3 native -rate threads=10 -pop seq=1..1840000000 -log interval=5 -col 'size=FIXED(200) n=FIXED(5)'" ]

instance_type_db: 'i8g.2xlarge'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps the naming of 2TB_high_disk_used_capacity.yaml is not ideal, since it really just defines a 2TB dataset, that can possibly be used for multiple setups.

perhaps it would make more sense to hardcode the instance type in test-cases/longevity/longevity-1TB-5days-authorization-and-tls-ssl.yaml that is the base-configuration of this "extention" file.

and also , it's less meaningful to set instance where no n_db_nodes parameter is found and is taken from another file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then separate this.


n_db_nodes: 6
n_loaders: 3
instance_type_db: 'i8ge.xlarge'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed?
What's tuned in this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cezarmoise As a owner of the testcase, can you elaborate more?


region_aware_loader: true
simulated_racks: 0
instance_type_db: 'i7i.xlarge'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be set, it doesn't really significant if the size is bigger.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cezarmoise As a owner of the testcase, do you agree?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this configuration being used?
What longevities? which tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it overriding a test that is called 1TB with a 2TB configuration?
Also, I don't understand why we need to pin it to a specific type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. it was set to i7i as we include i7i support to test it more regularly
  2. someone scale up the test with more data to get to more disk utilization (new pipeline)
  3. we introduced the sizing option to all tier1 cases, base on the instance value it was (including x86)

@pehala suggest pinning it cause he's scared the sizing would end up selecting other types or sized that would break the assumption about disk usage

one can argue if it's important or not, but the main goal of the sizing is preserved, i.e. other backend would be picking good enough instances, even if disk usage pattern would be a different.

so it's o.k. to use sizing and fixed in tandam (it's per it's design), the only risk it might drift when doing changes to either one, and if it's critical enough to keep the disk usage ratio ontop of the flexibility of change defaults of the sizing feature.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't think it's critical to keep the disk ratio as long as it's not too small.
  2. I rather not mixing up the two approaches and stick only with the new approach, avoiding as much as possible from fixing a test to instance type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's critical to keep the disk ratio as long as it's not too small.

I do, cluster has to work harder if it has more data and thus the testcase is better, besides we want to decrease TCO and thus we size customers clusters to have much more data -> so we should test more data as well.

I rather not mixing up the two approaches and stick only with the new approach, avoiding as much as possible from fixing a test to instance type.

There are fully compatible and use advantages of both and this is still only short term measure, we want to parametrize CS commands to fill it based on available disk size in the long term.

TLDR: There is no harm done with this PR, Tier 1 will be still working on ALL clouds

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be (and I believe there are) scenarios that are size specific.

We also aim for instance type agnostic, so we can switch instance types as we want and/or based on availability.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be (and I believe there are) scenarios that are size specific.

We also aim for instance type agnostic, so we can switch instance types as we want and/or based on availability.

Yes, but currently instance type agnostic is incompatible with size specific, which why this patch exists in the first place. I want to keep Tier 1 tests as size specific and add a feature which will make them type agnostic later. It is still agnostic on other clouds and it can run there, so again no harm done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants