fix(gce provision): detect partial provisioning and escalate to CRITICAL severity#15248
Conversation
There was a problem hiding this comment.
Pull request overview
This PR prevents SCT from silently continuing a test run with fewer nodes than requested when cloud pre-provisioning partially succeeds, and ensures provisioning-related failures are emitted as CRITICAL events so the EventsAnalyzer can interrupt the run.
Changes:
- Add guards in GCE
add_nodes()and AWS_create_or_find_instances()to validate discovered pre-provisioned instance count is >= requested count, otherwise raiseProvisionError. - Escalate
ProvisionError/ProvisionUnrecoverableErrortoSeverity.CRITICALwhen publishingTestFrameworkEventfromteardown_on_exception. - Add unit tests covering the new count validation and severity escalation behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
sdcm/cluster_gce.py |
Raises ProvisionError if fewer pre-provisioned instances are found than requested. |
sdcm/cluster_aws.py |
Adds the same partial pre-provisioning detection/guard for AWS. |
sdcm/tester.py |
Publishes provisioning failures with CRITICAL severity from teardown_on_exception. |
unit_tests/unit/test_partial_provision_guard.py |
New unit tests for GCE/AWS count validation and teardown severity escalation. |
2559dd6 to
afed04d
Compare
✅ Test Summary: PASSED✅ Precommit: PASSED
✅ Tests: PASSED
|
…CT-501) When pre-provisioning partially fails (e.g. GCE zone exhaustion after creating some instances), the test setUp would silently continue with fewer nodes than requested. This happened because add_nodes found the partially-created instances via _get_instances but never validated their count against the requested count. Root cause: the GCE instance provider uses parallel API inserts, so some instances can be created before a ZoneResourcesExhaustedError terminates the batch. These orphaned instances persist in GCE with matching TestId tags. When setUp later calls add_nodes, it discovers them via _get_instances and proceeds without checking the count. Additionally, provisioning failures raised during setUp were published as TestFrameworkEvent with default ERROR severity, which does not trigger the EventsAnalyzer interrupt mechanism (only CRITICAL does). Changes: - cluster_gce.py: validate len(instances) >= count when pre-provisioned instances are found; raise ProvisionError if fewer than expected - cluster_aws.py: add the same count validation on the non-REUSE pre-provisioned path (same bug pattern) - tester.py: escalate ProvisionError and ProvisionUnrecoverableError to CRITICAL severity in teardown_on_exception so the EventsAnalyzer can interrupt the test run - Add unit tests covering all three fixes
cezarmoise
left a comment
There was a problem hiding this comment.
I see that oci/aws also have something like instances = self._create_instances(count.... Why not add the check there too?
afed04d to
4b16bf4
Compare
@cezarmoise |
📝 WalkthroughB{isinstance ProvisionError or ProvisionUnrecoverableError?} |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
sdcm/cluster_gce.py (1)
659-668: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winEnforce exact instance counts on both branches.
The new guard only runs for the non-
REUSE_CLUSTERpath and only forlen(instances) < count. A reused cluster with too few instances still proceeds, andlen(instances) > countalso proceeds; in the latter case Lines 683-693 create too many nodes while_node_indexis only incremented bycount.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@sdcm/cluster_gce.py` around lines 659 - 668, The instance count validation in the cluster provisioning flow is incomplete: the REUSE_CLUSTER branch in the instance lookup/provisioning path allows too few nodes, and the non-reuse branch only checks for fewer-than-requested instances. Update the logic around the instance retrieval in the relevant cluster GCE method so both branches enforce an exact match between the pre-provisioned instances and the requested count, and fail when there are either fewer or more instances than expected before any node creation or _node_index advancement happens.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@sdcm/cluster_aws.py`:
- Around line 515-522: The pre-provisioned instance selection in _get_instances
currently rejects only shortages and returns the full list on overages, which
lets extra instances leak into AWSCluster.add_nodes. Update the instance count
check in cluster_aws.py so it enforces an exact match against count, raising
ProvisionError when len(instances) is not equal to count, and keep the return
path only for the exact-count case.
---
Outside diff comments:
In `@sdcm/cluster_gce.py`:
- Around line 659-668: The instance count validation in the cluster provisioning
flow is incomplete: the REUSE_CLUSTER branch in the instance lookup/provisioning
path allows too few nodes, and the non-reuse branch only checks for
fewer-than-requested instances. Update the logic around the instance retrieval
in the relevant cluster GCE method so both branches enforce an exact match
between the pre-provisioned instances and the requested count, and fail when
there are either fewer or more instances than expected before any node creation
or _node_index advancement happens.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d81799c8-6a3c-4470-8fdf-79413b881b6e
📒 Files selected for processing (4)
sdcm/cluster_aws.pysdcm/cluster_gce.pysdcm/tester.pyunit_tests/unit/test_partial_provision_guard.py
When pre-provisioning partially fails (e.g. GCE zone exhaustion after creating some instances), the test setUp would silently continue with fewer nodes than requested. This happened because add_nodes found the partially-created instances via _get_instances but never validated their count against the requested count.
Root cause: the GCE instance provider uses parallel API inserts, so some instances can be created before a ZoneResourcesExhaustedError terminates the batch. These orphaned instances persist in GCE with matching TestId tags. When setUp later calls add_nodes, it discovers them via _get_instances and proceeds without checking the count.
Additionally, provisioning failures raised during setUp were published as TestFrameworkEvent with default ERROR severity, which does not trigger the EventsAnalyzer interrupt mechanism (only CRITICAL does).
Changes:
Fixes: https://scylladb.atlassian.net/browse/SCT-501
Testing
PR pre-checks (self review)
backportlabelsReminders
sdcm/sct_config.py)unit-test/folder)