Skip to content

fix(gce provision): detect partial provisioning and escalate to CRITICAL severity#15248

Open
juliayakovlev wants to merge 1 commit into
scylladb:masterfrom
juliayakovlev:fix/sct-501-partial-provision-continues-test
Open

fix(gce provision): detect partial provisioning and escalate to CRITICAL severity#15248
juliayakovlev wants to merge 1 commit into
scylladb:masterfrom
juliayakovlev:fix/sct-501-partial-provision-continues-test

Conversation

@juliayakovlev

@juliayakovlev juliayakovlev commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

When pre-provisioning partially fails (e.g. GCE zone exhaustion after creating some instances), the test setUp would silently continue with fewer nodes than requested. This happened because add_nodes found the partially-created instances via _get_instances but never validated their count against the requested count.

Root cause: the GCE instance provider uses parallel API inserts, so some instances can be created before a ZoneResourcesExhaustedError terminates the batch. These orphaned instances persist in GCE with matching TestId tags. When setUp later calls add_nodes, it discovers them via _get_instances and proceeds without checking the count.

Additionally, provisioning failures raised during setUp were published as TestFrameworkEvent with default ERROR severity, which does not trigger the EventsAnalyzer interrupt mechanism (only CRITICAL does).

Changes:

  • cluster_gce.py: validate len(instances) >= count when pre-provisioned instances are found; raise ProvisionError if fewer than expected
  • cluster_aws.py: add the same count validation on the non-REUSE pre-provisioned path (same bug pattern)
  • tester.py: escalate ProvisionError and ProvisionUnrecoverableError to CRITICAL severity in teardown_on_exception so the EventsAnalyzer can interrupt the test run
  • Add unit tests covering all three fixes

Fixes: https://scylladb.atlassian.net/browse/SCT-501

Testing

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents SCT from silently continuing a test run with fewer nodes than requested when cloud pre-provisioning partially succeeds, and ensures provisioning-related failures are emitted as CRITICAL events so the EventsAnalyzer can interrupt the run.

Changes:

  • Add guards in GCE add_nodes() and AWS _create_or_find_instances() to validate discovered pre-provisioned instance count is >= requested count, otherwise raise ProvisionError.
  • Escalate ProvisionError / ProvisionUnrecoverableError to Severity.CRITICAL when publishing TestFrameworkEvent from teardown_on_exception.
  • Add unit tests covering the new count validation and severity escalation behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
sdcm/cluster_gce.py Raises ProvisionError if fewer pre-provisioned instances are found than requested.
sdcm/cluster_aws.py Adds the same partial pre-provisioning detection/guard for AWS.
sdcm/tester.py Publishes provisioning failures with CRITICAL severity from teardown_on_exception.
unit_tests/unit/test_partial_provision_guard.py New unit tests for GCE/AWS count validation and teardown severity escalation.

Comment thread unit_tests/unit/test_partial_provision_guard.py Outdated
Comment thread unit_tests/unit/test_partial_provision_guard.py Outdated
Comment thread unit_tests/unit/test_partial_provision_guard.py Outdated
Comment thread unit_tests/unit/test_partial_provision_guard.py Outdated
Comment thread unit_tests/unit/test_partial_provision_guard.py
Comment thread unit_tests/unit/test_partial_provision_guard.py
Comment thread unit_tests/unit/test_partial_provision_guard.py Outdated
Comment thread sdcm/tester.py Outdated
@juliayakovlev juliayakovlev force-pushed the fix/sct-501-partial-provision-continues-test branch 2 times, most recently from 2559dd6 to afed04d Compare June 29, 2026 08:30
@scylladb-promoter

scylladb-promoter commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

✅ Test Summary: PASSED

✅ Precommit: PASSED

Total Passed Failed Skipped
26 15 0 11

✅ Tests: PASSED

Total Passed Failed Errors Skipped
3679 3648 0 0 31

Full build log

…CT-501)

When pre-provisioning partially fails (e.g. GCE zone exhaustion after
creating some instances), the test setUp would silently continue with
fewer nodes than requested. This happened because add_nodes found the
partially-created instances via _get_instances but never validated
their count against the requested count.

Root cause: the GCE instance provider uses parallel API inserts, so
some instances can be created before a ZoneResourcesExhaustedError
terminates the batch. These orphaned instances persist in GCE with
matching TestId tags. When setUp later calls add_nodes, it discovers
them via _get_instances and proceeds without checking the count.

Additionally, provisioning failures raised during setUp were published
as TestFrameworkEvent with default ERROR severity, which does not
trigger the EventsAnalyzer interrupt mechanism (only CRITICAL does).

Changes:
- cluster_gce.py: validate len(instances) >= count when pre-provisioned
  instances are found; raise ProvisionError if fewer than expected
- cluster_aws.py: add the same count validation on the non-REUSE
  pre-provisioned path (same bug pattern)
- tester.py: escalate ProvisionError and ProvisionUnrecoverableError
  to CRITICAL severity in teardown_on_exception so the EventsAnalyzer
  can interrupt the test run
- Add unit tests covering all three fixes

@cezarmoise cezarmoise left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that oci/aws also have something like instances = self._create_instances(count.... Why not add the check there too?

@juliayakovlev juliayakovlev force-pushed the fix/sct-501-partial-provision-continues-test branch from afed04d to 4b16bf4 Compare June 29, 2026 09:06
@juliayakovlev juliayakovlev added test-provision-gce Run provision test on GCE and removed test-provision-gce Run provision test on GCE test-provision-aws Run provision test on AWS labels Jun 29, 2026
@juliayakovlev

Copy link
Copy Markdown
Contributor Author

I see that oci/aws also have something like instances = self._create_instances(count.... Why not add the check there too?

@cezarmoise
The count check is added already in the cluster_aws.
Other backends code flow is different than GCE/AWS

@juliayakovlev juliayakovlev marked this pull request as ready for review June 29, 2026 16:22
@juliayakovlev juliayakovlev requested a review from fruch as a code owner June 29, 2026 16:22
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough B{isinstance ProvisionError or ProvisionUnrecoverableError?}
B -->|Yes| C[severity = Severity.CRITICAL]
B -->|No| D{exc has severity attr?}
D -->|Yes| E[severity = exc.severity]
D -->|No| F[severity = None]
C --> G[TestFrameworkEvent published]
E --> G
F --> G
</diagram>
</layer>
<layer id="layer_tests" title="Unit tests for provision guards and severity escalation" depends_on="layer_severity">
<summary>New test module covers GCE count-mismatch ProvisionError, GCE zero-instances fallback to _create_instances, AWS count-mismatch ProvisionError, and teardown_on_exception severity mapping for various exception types.</summary>
<ranges>
range_56d6f033a8c5
range_55bdf3f8538a
range_2d8b7c8ddc5c
range_0e0b36c90eab
range_9836d887c504
range_8f795e5acc40
</ranges>
</layer>
</cohort>
<unassigned_ranges>
</unassigned_ranges>
-->

## Walkthrough

Adds count-validation guards in `AWSCluster._create_or_find_instances` and `GCECluster.add_nodes` that raise `ProvisionError` when the number of discovered pre-provisioned instances is less than the requested count. Updates `teardown_on_exception` in `tester.py` to map provisioning exceptions (`ProvisionError`, `ProvisionUnrecoverableError`) to `Severity.CRITICAL` when emitting `TestFrameworkEvent`. A new unit test module covers all added behaviors.

## Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

## Possibly related PRs

- [scylladb/scylla-cluster-tests#15122](https://github.com/scylladb/scylla-cluster-tests/pull/15122): Directly modifies the same `teardown_on_exception` function in `sdcm/tester.py` to change provisioning failure reporting behavior.

## Suggested labels

`Bug`, `Infrastructure`, `GCE`

## Suggested reviewers

- fruch
- dimakr
- pehala

</details>

<!-- walkthrough_end -->
<!-- pre_merge_checks_walkthrough_start -->

<details>
<summary>🚥 Pre-merge checks | ✅ 4 | ❌ 1</summary>

### ❌ Failed checks (1 warning)

|     Check name     | Status     | Explanation                                                                           | Resolution                                                                         |
| :----------------: | :--------- | :------------------------------------------------------------------------------------ | :--------------------------------------------------------------------------------- |
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 65.00% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. |

<details>
<summary>✅ Passed checks (4 passed)</summary>

|         Check name         | Status   | Explanation                                                                                                      |
| :------------------------: | :------- | :--------------------------------------------------------------------------------------------------------------- |
|         Title check        | ✅ Passed | Title is concise and accurately captures the main fixes: partial provisioning detection and severity escalation. |
|      Description check     | ✅ Passed | Description follows the template and includes testing, self-review checklists, and reminders.                    |
|     Linked Issues check    | ✅ Passed | Check skipped because no linked issues were found for this pull request.                                         |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request.                                         |

</details>

</details>

<!-- pre_merge_checks_walkthrough_end -->
<!-- finishing_touch_checkbox_start -->

<details>
<summary>✨ Finishing Touches</summary>

<details>
<summary>🧪 Generate unit tests (beta)</summary>

- [ ] <!-- {"checkboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "radioGroupId": "utg-output-choice-group-unknown_comment_id"} -->   Create PR with unit tests

</details>

</details>

<!-- finishing_touch_checkbox_end -->
<!-- tips_start -->

---

Thanks for using [CodeRabbit](https://coderabbit.ai?utm_source=oss&utm_medium=github&utm_campaign=scylladb/scylla-cluster-tests&utm_content=15248)! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

<details>
<summary>❤️ Share</summary>

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai)
- [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai)
- [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai)
- [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

</details>


<sub>Comment `@coderabbitai help` to get the list of available commands.</sub>

<!-- tips_end -->

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
sdcm/cluster_gce.py (1)

659-668: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Enforce exact instance counts on both branches.

The new guard only runs for the non-REUSE_CLUSTER path and only for len(instances) < count. A reused cluster with too few instances still proceeds, and len(instances) > count also proceeds; in the latter case Lines 683-693 create too many nodes while _node_index is only incremented by count.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@sdcm/cluster_gce.py` around lines 659 - 668, The instance count validation in
the cluster provisioning flow is incomplete: the REUSE_CLUSTER branch in the
instance lookup/provisioning path allows too few nodes, and the non-reuse branch
only checks for fewer-than-requested instances. Update the logic around the
instance retrieval in the relevant cluster GCE method so both branches enforce
an exact match between the pre-provisioned instances and the requested count,
and fail when there are either fewer or more instances than expected before any
node creation or _node_index advancement happens.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@sdcm/cluster_aws.py`:
- Around line 515-522: The pre-provisioned instance selection in _get_instances
currently rejects only shortages and returns the full list on overages, which
lets extra instances leak into AWSCluster.add_nodes. Update the instance count
check in cluster_aws.py so it enforces an exact match against count, raising
ProvisionError when len(instances) is not equal to count, and keep the return
path only for the exact-count case.

---

Outside diff comments:
In `@sdcm/cluster_gce.py`:
- Around line 659-668: The instance count validation in the cluster provisioning
flow is incomplete: the REUSE_CLUSTER branch in the instance lookup/provisioning
path allows too few nodes, and the non-reuse branch only checks for
fewer-than-requested instances. Update the logic around the instance retrieval
in the relevant cluster GCE method so both branches enforce an exact match
between the pre-provisioned instances and the requested count, and fail when
there are either fewer or more instances than expected before any node creation
or _node_index advancement happens.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d81799c8-6a3c-4470-8fdf-79413b881b6e

📥 Commits

Reviewing files that changed from the base of the PR and between 20a3a25 and 4b16bf4.

📒 Files selected for processing (4)
  • sdcm/cluster_aws.py
  • sdcm/cluster_gce.py
  • sdcm/tester.py
  • unit_tests/unit/test_partial_provision_guard.py

Comment thread sdcm/cluster_aws.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants