[AV-134634] Fix - Fail fast and auto-recover from private endpoint service failed states#647
[AV-134634] Fix - Fail fast and auto-recover from private endpoint service failed states#647cloudy-vishnu wants to merge 6 commits into
Conversation
|
🚨 PR title "AV-134634: [FEAT] - Fail fast and auto-recover from private endpoint service failed states" does not match the required format. |
|
🚨 PR title "AV-134634: FIX - Fail fast and auto-recover from private endpoint service failed states" does not match the required format. |
IsaacLambat
left a comment
There was a problem hiding this comment.
mostly nit, though the other reviewers comments make sense.
Code comments are quite excessive at times, so I think lots can be trimmed/removed. Aim for clean code that describes what is happening so we don't need to have comments explaining whats happening.
…ouchbase-capella into AV-134634-fix-terraform-polling-failed-state
There was a problem hiding this comment.
Pull request overview
This PR makes the private endpoint service resource/data source “status-aware” by consuming a new optional lifecycle status from the Capella API, enabling fail-fast behavior on terminal failed states and automated cleanup/remediation paths to avoid permanently wedged Terraform state.
Changes:
- Extend the private endpoint service status API model to include an optional
statuslifecycle field and surface it as a computedstatusattribute in the resource/data source. - Rewrite the enable/disable poll loop to use lifecycle
statuswhen present (fail fast onenableFailed/disableFailed, keep polling on transient states, fallback to boolean on older control planes), and add recovery logic for terminal enable failure (trigger DELETE cleanup + remove from state). - Improve behavior for failed private endpoint associations by removing
failed/rejectedassociations from state to force clean re-association; update docs accordingly.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/schema/private_endpoint_service.go | Adds status to the Terraform model for private endpoint service. |
| internal/resources/private_endpoints.go | Removes terminal failed/rejected associations from state during Read. |
| internal/resources/private_endpoint_service.go | Adds lifecycle status constants, status-aware polling, terminal failure handling (cleanup/remove state), and exposes status in state. |
| internal/resources/private_endpoint_service_test.go | Adds unit coverage for status-aware polling and terminal failure cleanup/removal logic. |
| internal/resources/private_endpoint_service_schema.go | Adds computed status attribute and documents terminal enable failure behavior in schema description. |
| internal/errors/errors.go | Adds typed terminal errors for enable/disable failed states. |
| internal/datasources/private_endpoint_service.go | Populates computed status from the API when available. |
| internal/datasources/private_endpoint_service_schema.go | Adds computed status attribute to the data source schema. |
| internal/api/private_endpoint_service.go | Adds optional Status *string to the API response model. |
| docs/resources/private_endpoint_service.md | Documents status and enableFailed recovery behavior for the resource. |
| docs/resources/cluster.md | Improves deletion_protection documentation text. |
| docs/data-sources/private_endpoint_service.md | Documents status for the data source. |
| docs/data-sources/free_tier_clusters.md | Improves deletion_protection documentation text. |
| docs/data-sources/clusters.md | Improves deletion_protection documentation text. |
Comments suppressed due to low confidence (1)
internal/resources/private_endpoint_service.go:277
- Missing space in this error detail string produces a garbled message ("enablingprivate...").
resp.Diagnostics.AddError(
"Error "+status+" private endpoint service",
"Error "+status+"private endpoint service, unexpected error: "+err.Error(),
)
| } | ||
|
|
||
| diags = resp.State.Set(ctx, &refreshedState) | ||
| if refreshedState.Status.ValueString() == statusEnableFailed { |
There was a problem hiding this comment.
Read removes the resource on enableFailed without running cleanup - unlike Create/Update, the next apply will re-POST enable onto orphaned infra. Should this route through cleanupFailedEnable too?
There was a problem hiding this comment.
Read is called during plan/refresh and must be side-effect free; issuing a teardown (DELETE on cloud infra) from Read is a Terraform anti-pattern and will surprise users by destroying things during a plan.
Jira
Description
When private endpoint service enablement fails, the backend reaches a terminal
enableFailedstate, but the provider only ever saw{"enabled": false}. It polled that boolean for 60 minutes, timed out with a generic error, and re-POSTed enable on the next apply, looping over orphaned infra with no automated recovery. Failed endpoint associations had a similar problem: they stayed in state forever.This PR makes the provider status-aware so it fails fast on terminal states, automatically cleans up a failed enable, and lets the next apply perform a clean re-create with no manual intervention. It pairs with the control-plane change that exposes
statusand allows disable fromenableFailed; it is fully backward compatible with control planes that do not yet report a status.Changes
API model
Status *stringtoGetPrivateEndpointServiceStatusResponse.Status-aware polling
waitUntilStatusChangesto use the lifecyclestatuswhen present:enableFailed/disableFailedreturn immediately with a typed terminal error;enabling/disabling/unknownkeep polling (transient);enabled/disabled/idleresolve against theenabledboolean; status absent (older control plane) falls back to the boolean with unchanged behavior.Terminal-failure handling
enableFailed, issue a disable (DELETE) to trigger backend cleanup, wait up to 15 minutes fordisabled/idle, remove the resource from state, and return an actionable error. The resource is removed even if cleanup itself fails, with a message directing escalation, because leaving a permanently-failed resource in state recreates the stuck-pipeline problem.disableFailed, fail fast and keep the resource in state so a re-run ofterraform destroyretries naturally.enableFailed, remove the resource from state so drift detection recreates it instead of leaving it wedged.ErrPrivateEndpointServiceEnableFailedandErrPrivateEndpointServiceDisableFailed.Failed associations
private_endpointsRead now removes bothrejectedandfailedassociations from state to force a clean re-association.Status exposure, schema, and docs
statusattribute to the resource and data source, populated from the API.Compatibility
status. Against an older control plane the provider falls back to boolean polling, so behavior is unchanged. Recommended rollout is control plane first, then this provider release.statusattribute follows the same pattern as the existingcurrent_stateattribute on the cluster resource (plain computed, planned as known-after-apply), so enable/disable toggles do not cause inconsistent-result errors.Type of Change
Manual Testing Approach
How was this change tested and do you have evidence? (REQUIRED: Select at least 1)
Testing
Testing
Unit tests
go vet ./...is clean andmake testpasses (run viamake testsoOPENAPI_SPEC_URLis exported; a barego testleaves schema-description tests without the spec and is expected to fail). Coverage includes the new status-aware poll loop, the terminal-failure handling, and the failed/rejected association removal.Acceptance test
TestAccPrivateEndpointServiceEnableDisablepasses against real Capella (enable, read, disable):--- PASS: TestAccPrivateEndpointServiceEnableDisable (273.89s)
This exercises status exposure, the computed
statusattribute, the 30s status-aware polling, and the enable/disable lifecycle against a real backend with no errors.Terminal-failure path
A healthy cluster enables successfully, so the
enableFailedrecovery path is not reachable from the happy-path apply above. It is covered by unit tests (fail-fast onenableFailed, DELETE-triggered cleanup,RemoveResource, and the cleanup-failure escalation), and can be exercised manually with a mock control plane returning{"enabled": false, "status": "enableFailed"}on GET and202on DELETE, or by inducing an enable-job failure in the backend. Verified behavior in that case: apply fails fast (within one poll interval, not the 60-minute timeout), the resource is removed from state, and a subsequentterraform applyperforms a clean re-create with no manual intervention.Further comments