fix(gce provision): detect partial provisioning and escalate to CRITICAL severity by juliayakovlev · Pull Request #15248 · scylladb/scylla-cluster-tests

juliayakovlev · 2026-06-29T08:19:05Z

When pre-provisioning partially fails (e.g. GCE zone exhaustion after creating some instances), the test setUp would silently continue with fewer nodes than requested. This happened because add_nodes found the partially-created instances via _get_instances but never validated their count against the requested count.

Root cause: the GCE instance provider uses parallel API inserts, so some instances can be created before a ZoneResourcesExhaustedError terminates the batch. These orphaned instances persist in GCE with matching TestId tags. When setUp later calls add_nodes, it discovers them via _get_instances and proceeds without checking the count.

Additionally, provisioning failures raised during setUp were published as TestFrameworkEvent with default ERROR severity, which does not trigger the EventsAnalyzer interrupt mechanism (only CRITICAL does).

Changes:

cluster_gce.py: validate len(instances) >= count when pre-provisioned instances are found; raise ProvisionError if fewer than expected
cluster_aws.py: add the same count validation on the non-REUSE pre-provisioned path (same bug pattern)
tester.py: escalate ProvisionError and ProvisionUnrecoverableError to CRITICAL severity in teardown_on_exception so the EventsAnalyzer can interrupt the test run
Add unit tests covering all three fixes

Fixes: https://scylladb.atlassian.net/browse/SCT-501

Testing

artifacts-gce-image-test
artifacts-ami-test
AWS provision
GCE provision
longevity with 9 nodes - all nodes were created, the issue has not been reproduced

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

Copilot

Pull request overview

This PR prevents SCT from silently continuing a test run with fewer nodes than requested when cloud pre-provisioning partially succeeds, and ensures provisioning-related failures are emitted as CRITICAL events so the EventsAnalyzer can interrupt the run.

Changes:

Add guards in GCE add_nodes() and AWS _create_or_find_instances() to validate discovered pre-provisioned instance count is >= requested count, otherwise raise ProvisionError.
Escalate ProvisionError / ProvisionUnrecoverableError to Severity.CRITICAL when publishing TestFrameworkEvent from teardown_on_exception.
Add unit tests covering the new count validation and severity escalation behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
`sdcm/cluster_gce.py`	Raises `ProvisionError` if fewer pre-provisioned instances are found than requested.
`sdcm/cluster_aws.py`	Adds the same partial pre-provisioning detection/guard for AWS.
`sdcm/tester.py`	Publishes provisioning failures with CRITICAL severity from `teardown_on_exception`.
`unit_tests/unit/test_partial_provision_guard.py`	New unit tests for GCE/AWS count validation and teardown severity escalation.

scylladb-promoter · 2026-06-29T08:32:39Z

✅ Test Summary: PASSED

✅ Precommit: PASSED

Total	Passed	Failed	Skipped
26	15	0	11

✅ Tests: PASSED

Total	Passed	Failed	Errors	Skipped
3679	3648	0	0	31

Full build log

…CT-501) When pre-provisioning partially fails (e.g. GCE zone exhaustion after creating some instances), the test setUp would silently continue with fewer nodes than requested. This happened because add_nodes found the partially-created instances via _get_instances but never validated their count against the requested count. Root cause: the GCE instance provider uses parallel API inserts, so some instances can be created before a ZoneResourcesExhaustedError terminates the batch. These orphaned instances persist in GCE with matching TestId tags. When setUp later calls add_nodes, it discovers them via _get_instances and proceeds without checking the count. Additionally, provisioning failures raised during setUp were published as TestFrameworkEvent with default ERROR severity, which does not trigger the EventsAnalyzer interrupt mechanism (only CRITICAL does). Changes: - cluster_gce.py: validate len(instances) >= count when pre-provisioned instances are found; raise ProvisionError if fewer than expected - cluster_aws.py: add the same count validation on the non-REUSE pre-provisioned path (same bug pattern) - tester.py: escalate ProvisionError and ProvisionUnrecoverableError to CRITICAL severity in teardown_on_exception so the EventsAnalyzer can interrupt the test run - Add unit tests covering all three fixes

cezarmoise

I see that oci/aws also have something like instances = self._create_instances(count.... Why not add the check there too?

juliayakovlev · 2026-06-29T16:20:54Z

I see that oci/aws also have something like instances = self._create_instances(count.... Why not add the check there too?

@cezarmoise
The count check is added already in the cluster_aws.
Other backends code flow is different than GCE/AWS

coderabbitai · 2026-06-29T16:27:12Z

📝 Walkthrough

B{isinstance ProvisionError or ProvisionUnrecoverableError?}

B -->|Yes| C[severity = Severity.CRITICAL]
B -->|No| D{exc has severity attr?}
D -->|Yes| E[severity = exc.severity]
D -->|No| F[severity = None]
C --> G[TestFrameworkEvent published]
E --> G
F --> G

</diagram>
</layer>
<layer id="layer_tests" title="Unit tests for provision guards and severity escalation" depends_on="layer_severity">
<summary>New test module covers GCE count-mismatch ProvisionError, GCE zero-instances fallback to _create_instances, AWS count-mismatch ProvisionError, and teardown_on_exception severity mapping for various exception types.</summary>
<ranges>
range_56d6f033a8c5
range_55bdf3f8538a
range_2d8b7c8ddc5c
range_0e0b36c90eab
range_9836d887c504
range_8f795e5acc40
</ranges>
</layer>
</cohort>
<unassigned_ranges>
</unassigned_ranges>
-->

## Walkthrough

Adds count-validation guards in `AWSCluster._create_or_find_instances` and `GCECluster.add_nodes` that raise `ProvisionError` when the number of discovered pre-provisioned instances is less than the requested count. Updates `teardown_on_exception` in `tester.py` to map provisioning exceptions (`ProvisionError`, `ProvisionUnrecoverableError`) to `Severity.CRITICAL` when emitting `TestFrameworkEvent`. A new unit test module covers all added behaviors.

## Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

## Possibly related PRs

- [scylladb/scylla-cluster-tests#15122](https://github.com/scylladb/scylla-cluster-tests/pull/15122): Directly modifies the same `teardown_on_exception` function in `sdcm/tester.py` to change provisioning failure reporting behavior.

## Suggested labels

`Bug`, `Infrastructure`, `GCE`

## Suggested reviewers

- fruch
- dimakr
- pehala

</details>

<!-- walkthrough_end -->
<!-- pre_merge_checks_walkthrough_start -->

<details>
<summary>🚥 Pre-merge checks | ✅ 4 | ❌ 1</summary>

### ❌ Failed checks (1 warning)

|     Check name     | Status     | Explanation                                                                           | Resolution                                                                         |
| :----------------: | :--------- | :------------------------------------------------------------------------------------ | :--------------------------------------------------------------------------------- |
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 65.00% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. |

<details>
<summary>✅ Passed checks (4 passed)</summary>

|         Check name         | Status   | Explanation                                                                                                      |
| :------------------------: | :------- | :--------------------------------------------------------------------------------------------------------------- |
|         Title check        | ✅ Passed | Title is concise and accurately captures the main fixes: partial provisioning detection and severity escalation. |
|      Description check     | ✅ Passed | Description follows the template and includes testing, self-review checklists, and reminders.                    |
|     Linked Issues check    | ✅ Passed | Check skipped because no linked issues were found for this pull request.                                         |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request.                                         |

</details>

</details>

<!-- pre_merge_checks_walkthrough_end -->
<!-- finishing_touch_checkbox_start -->

<details>
<summary>✨ Finishing Touches</summary>

<details>
<summary>🧪 Generate unit tests (beta)</summary>

- [ ] <!-- {"checkboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "radioGroupId": "utg-output-choice-group-unknown_comment_id"} -->   Create PR with unit tests

</details>

</details>

<!-- finishing_touch_checkbox_end -->
<!-- tips_start -->

---

Thanks for using [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=scylladb/scylla-cluster-tests&utm_content=15248)! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

<details>
<summary>❤️ Share</summary>

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)
- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)
- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)
- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

</details>


<sub>Comment `@coderabbitai help` to get the list of available commands.</sub>

<!-- tips_end -->

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

sdcm/cluster_gce.py (1)
659-668: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Enforce exact instance counts on both branches.

The new guard only runs for the non-REUSE_CLUSTER path and only for len(instances) < count. A reused cluster with too few instances still proceeds, and len(instances) > count also proceeds; in the latter case Lines 683-693 create too many nodes while _node_index is only incremented by count.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@sdcm/cluster_gce.py` around lines 659 - 668, The instance count validation in
the cluster provisioning flow is incomplete: the REUSE_CLUSTER branch in the
instance lookup/provisioning path allows too few nodes, and the non-reuse branch
only checks for fewer-than-requested instances. Update the logic around the
instance retrieval in the relevant cluster GCE method so both branches enforce
an exact match between the pre-provisioned instances and the requested count,
and fail when there are either fewer or more instances than expected before any
node creation or _node_index advancement happens.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@sdcm/cluster_aws.py`:
- Around line 515-522: The pre-provisioned instance selection in _get_instances
currently rejects only shortages and returns the full list on overages, which
lets extra instances leak into AWSCluster.add_nodes. Update the instance count
check in cluster_aws.py so it enforces an exact match against count, raising
ProvisionError when len(instances) is not equal to count, and keep the return
path only for the exact-count case.

---

Outside diff comments:
In `@sdcm/cluster_gce.py`:
- Around line 659-668: The instance count validation in the cluster provisioning
flow is incomplete: the REUSE_CLUSTER branch in the instance lookup/provisioning
path allows too few nodes, and the non-reuse branch only checks for
fewer-than-requested instances. Update the logic around the instance retrieval
in the relevant cluster GCE method so both branches enforce an exact match
between the pre-provisioned instances and the requested count, and fail when
there are either fewer or more instances than expected before any node creation
or _node_index advancement happens.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d81799c8-6a3c-4470-8fdf-79413b881b6e

📥 Commits

Reviewing files that changed from the base of the PR and between 20a3a25 and 4b16bf4.

📒 Files selected for processing (4)

sdcm/cluster_aws.py
sdcm/cluster_gce.py
sdcm/tester.py
unit_tests/unit/test_partial_provision_guard.py

juliayakovlev self-assigned this Jun 29, 2026

juliayakovlev added test-provision-aws Run provision test on AWS test-provision-gce Run provision test on GCE backport/2025.1 backport/2025.4 backport/2026.1 backport/2026.2 labels Jun 29, 2026

juliayakovlev requested review from cezarmoise and Copilot June 29, 2026 08:19

github-actions Bot added the P2 High Priority label Jun 29, 2026

Copilot started reviewing on behalf of juliayakovlev June 29, 2026 08:20 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

juliayakovlev force-pushed the fix/sct-501-partial-provision-continues-test branch 2 times, most recently from 2559dd6 to afed04d Compare June 29, 2026 08:30

juliayakovlev added the ai-assisted label Jun 29, 2026

cezarmoise reviewed Jun 29, 2026

View reviewed changes

juliayakovlev force-pushed the fix/sct-501-partial-provision-continues-test branch from afed04d to 4b16bf4 Compare June 29, 2026 09:06

juliayakovlev added test-provision-gce Run provision test on GCE and removed test-provision-gce Run provision test on GCE test-provision-aws Run provision test on AWS labels Jun 29, 2026

juliayakovlev marked this pull request as ready for review June 29, 2026 16:22

juliayakovlev requested a review from fruch as a code owner June 29, 2026 16:22

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread sdcm/cluster_aws.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gce provision): detect partial provisioning and escalate to CRITICAL severity#15248

fix(gce provision): detect partial provisioning and escalate to CRITICAL severity#15248
juliayakovlev wants to merge 1 commit into
scylladb:masterfrom
juliayakovlev:fix/sct-501-partial-provision-continues-test

juliayakovlev commented Jun 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scylladb-promoter commented Jun 29, 2026 •

edited

Loading

Uh oh!

cezarmoise left a comment

Uh oh!

juliayakovlev commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

juliayakovlev commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

PR pre-checks (self review)

Reminders

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scylladb-promoter commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Test Summary: PASSED

✅ Precommit: PASSED

✅ Tests: PASSED

Uh oh!

cezarmoise left a comment

Choose a reason for hiding this comment

Uh oh!

juliayakovlev commented Jun 29, 2026

Uh oh!

coderabbitai Bot commented Jun 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juliayakovlev commented Jun 29, 2026 •

edited

Loading

scylladb-promoter commented Jun 29, 2026 •

edited

Loading