Skip to content

[AV-134634] Fix - Fail fast and auto-recover from private endpoint service failed states#647

Open
cloudy-vishnu wants to merge 6 commits into
mainfrom
AV-134634-fix-terraform-polling-failed-state
Open

[AV-134634] Fix - Fail fast and auto-recover from private endpoint service failed states#647
cloudy-vishnu wants to merge 6 commits into
mainfrom
AV-134634-fix-terraform-polling-failed-state

Conversation

@cloudy-vishnu

@cloudy-vishnu cloudy-vishnu commented Jun 16, 2026

Copy link
Copy Markdown

Jira

  • AV-134634

Description

When private endpoint service enablement fails, the backend reaches a terminal enableFailed state, but the provider only ever saw {"enabled": false}. It polled that boolean for 60 minutes, timed out with a generic error, and re-POSTed enable on the next apply, looping over orphaned infra with no automated recovery. Failed endpoint associations had a similar problem: they stayed in state forever.

This PR makes the provider status-aware so it fails fast on terminal states, automatically cleans up a failed enable, and lets the next apply perform a clean re-create with no manual intervention. It pairs with the control-plane change that exposes status and allows disable from enableFailed; it is fully backward compatible with control planes that do not yet report a status.

Changes

API model

  • Added an optional Status *string to GetPrivateEndpointServiceStatusResponse.

Status-aware polling

  • Rewrote waitUntilStatusChanges to use the lifecycle status when present: enableFailed/disableFailed return immediately with a typed terminal error; enabling/disabling/unknown keep polling (transient); enabled/disabled/idle resolve against the enabled boolean; status absent (older control plane) falls back to the boolean with unchanged behavior.
  • Reduced the poll interval to 30s and kept the 60-minute timeout as a backstop.

Terminal-failure handling

  • Create and enable-flavored Update: on enableFailed, issue a disable (DELETE) to trigger backend cleanup, wait up to 15 minutes for disabled/idle, remove the resource from state, and return an actionable error. The resource is removed even if cleanup itself fails, with a message directing escalation, because leaving a permanently-failed resource in state recreates the stuck-pipeline problem.
  • Delete: on disableFailed, fail fast and keep the resource in state so a re-run of terraform destroy retries naturally.
  • Read: on enableFailed, remove the resource from state so drift detection recreates it instead of leaving it wedged.
  • Added ErrPrivateEndpointServiceEnableFailed and ErrPrivateEndpointServiceDisableFailed.

Failed associations

  • private_endpoints Read now removes both rejected and failed associations from state to force a clean re-association.

Status exposure, schema, and docs

  • Added a computed status attribute to the resource and data source, populated from the API.
  • Documented terminal vs transient states in the resource and data source docs.

Compatibility

  • The change only activates when the API reports a status. Against an older control plane the provider falls back to boolean polling, so behavior is unchanged. Recommended rollout is control plane first, then this provider release.
  • The computed status attribute follows the same pattern as the existing current_state attribute on the cluster resource (plain computed, planned as known-after-apply), so enable/disable toggles do not cause inconsistent-result errors.

Type of Change

  • Bug fix (non-breaking change which fixes an issue).
  • New feature (non-breaking change which adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change updates the ci/cd workflow.
  • Documentation fix/enhancement.

Manual Testing Approach

How was this change tested and do you have evidence? (REQUIRED: Select at least 1)

  • Manually tested
  • Unit tested
  • Acceptance tested
  • Unable to test / will not test (Please provide comments in section below)

Testing

Testing

Unit tests

go vet ./... is clean and make test passes (run via make test so OPENAPI_SPEC_URL is exported; a bare go test leaves schema-description tests without the spec and is expected to fail). Coverage includes the new status-aware poll loop, the terminal-failure handling, and the failed/rejected association removal.

Acceptance test

TestAccPrivateEndpointServiceEnableDisable passes against real Capella (enable, read, disable):

--- PASS: TestAccPrivateEndpointServiceEnableDisable (273.89s)

The other failures in the suite run are unrelated environment/backend issues (bucket-quota exhaustion on the shared cluster and transient 500s during bucket creation), not changes in this PR.

### Manual end-to-end against a local control plane

Validated the full lifecycle against a local `cbclocal` control plane (which exposes the new `status` field), using a dev-override build of this provider.

**Enable (`terraform apply`)**
- Plan renders `status = (known after apply)`, confirming the computed attribute is planned as unknown, so it cannot trigger a "provider produced inconsistent result after apply" error when the value changes.
- The status-aware poll loop polls the service status until the backend job completes and converges (creation completed after ~5m31s).
- `terraform show` reports the resolved state:
- resource "couchbase-capella_private_endpoint_service" "svc" {

cluster_id      = "f7c7e93e-554e-46bc-bcd0-1d7cd7eff9cb"

enabled         = true

organization_id = "adb4fb4c-1d98-4287-ac33-230742d2cc76"

project_id      = "c09035e8-3971-4b79-a6ec-bccd9056041f"

status          = "enabled"

}
**Disable (`terraform destroy`)**
- Refresh succeeds and the service is disabled cleanly (destroy completed after ~1m):
- Destroy complete! Resources: 1 destroyed.

This exercises status exposure, the computed status attribute, the 30s status-aware polling, and the enable/disable lifecycle against a real backend with no errors.

Terminal-failure path

A healthy cluster enables successfully, so the enableFailed recovery path is not reachable from the happy-path apply above. It is covered by unit tests (fail-fast on enableFailed, DELETE-triggered cleanup, RemoveResource, and the cleanup-failure escalation), and can be exercised manually with a mock control plane returning {"enabled": false, "status": "enableFailed"} on GET and 202 on DELETE, or by inducing an enable-job failure in the backend. Verified behavior in that case: apply fails fast (within one poll interval, not the 60-minute timeout), the resource is removed from state, and a subsequent terraform apply performs a clean re-create with no manual intervention.

Further comments

@github-actions github-actions Bot added bug Something isn't working enhancement New feature or request labels Jun 16, 2026
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

🚨 PR title "AV-134634: [FEAT] - Fail fast and auto-recover from private endpoint service failed states" does not match the required format.

Requirements:
- Must start with [AV-XXXXX] where X is any number of digits
- After the bracket, must start with a Verb (Add, Update, Fix, etc.)
- The Verb must start with an uppercase letter

Expected format: [AV-XXXXX] Verb ...
Example: [AV-98659] Implement Cluster On/Off feature
Valid verbs: Add, Update, Fix, Implement, Remove, Refactor, etc.

@cloudy-vishnu cloudy-vishnu changed the title [AV-134634]: [FEAT] - Fail fast and auto-recover from private endpoint service failed states [AV-134634]: FIX - Fail fast and auto-recover from private endpoint service failed states Jun 16, 2026
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown

🚨 PR title "AV-134634: FIX - Fail fast and auto-recover from private endpoint service failed states" does not match the required format.

Requirements:
- Must start with [AV-XXXXX] where X is any number of digits
- After the bracket, must start with a Verb (Add, Update, Fix, etc.)
- The Verb must start with an uppercase letter

Expected format: [AV-XXXXX] Verb ...
Example: [AV-98659] Implement Cluster On/Off feature
Valid verbs: Add, Update, Fix, Implement, Remove, Refactor, etc.

@cloudy-vishnu cloudy-vishnu changed the title [AV-134634]: FIX - Fail fast and auto-recover from private endpoint service failed states [AV-134634] Fix - Fail fast and auto-recover from private endpoint service failed states Jun 16, 2026
@cloudy-vishnu cloudy-vishnu marked this pull request as ready for review June 16, 2026 21:09
@cloudy-vishnu cloudy-vishnu requested a review from a team as a code owner June 16, 2026 21:09
@couchbasecloud couchbasecloud deleted a comment from factory-droid Bot Jun 16, 2026
@couchbasecloud couchbasecloud deleted a comment from factory-droid Bot Jun 16, 2026
Comment thread internal/resources/private_endpoint_service.go
Comment thread docs/resources/private_endpoint_service.md

@IsaacLambat IsaacLambat left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly nit, though the other reviewers comments make sense.

Code comments are quite excessive at times, so I think lots can be trimmed/removed. Aim for clean code that describes what is happening so we don't need to have comments explaining whats happening.

Comment thread docs/data-sources/private_endpoint_service.md
Comment thread internal/resources/private_endpoint_service.go Outdated
Comment thread internal/resources/private_endpoint_service.go Outdated
Comment thread internal/resources/private_endpoint_service.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the private endpoint service resource/data source “status-aware” by consuming a new optional lifecycle status from the Capella API, enabling fail-fast behavior on terminal failed states and automated cleanup/remediation paths to avoid permanently wedged Terraform state.

Changes:

  • Extend the private endpoint service status API model to include an optional status lifecycle field and surface it as a computed status attribute in the resource/data source.
  • Rewrite the enable/disable poll loop to use lifecycle status when present (fail fast on enableFailed/disableFailed, keep polling on transient states, fallback to boolean on older control planes), and add recovery logic for terminal enable failure (trigger DELETE cleanup + remove from state).
  • Improve behavior for failed private endpoint associations by removing failed/rejected associations from state to force clean re-association; update docs accordingly.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/schema/private_endpoint_service.go Adds status to the Terraform model for private endpoint service.
internal/resources/private_endpoints.go Removes terminal failed/rejected associations from state during Read.
internal/resources/private_endpoint_service.go Adds lifecycle status constants, status-aware polling, terminal failure handling (cleanup/remove state), and exposes status in state.
internal/resources/private_endpoint_service_test.go Adds unit coverage for status-aware polling and terminal failure cleanup/removal logic.
internal/resources/private_endpoint_service_schema.go Adds computed status attribute and documents terminal enable failure behavior in schema description.
internal/errors/errors.go Adds typed terminal errors for enable/disable failed states.
internal/datasources/private_endpoint_service.go Populates computed status from the API when available.
internal/datasources/private_endpoint_service_schema.go Adds computed status attribute to the data source schema.
internal/api/private_endpoint_service.go Adds optional Status *string to the API response model.
docs/resources/private_endpoint_service.md Documents status and enableFailed recovery behavior for the resource.
docs/resources/cluster.md Improves deletion_protection documentation text.
docs/data-sources/private_endpoint_service.md Documents status for the data source.
docs/data-sources/free_tier_clusters.md Improves deletion_protection documentation text.
docs/data-sources/clusters.md Improves deletion_protection documentation text.
Comments suppressed due to low confidence (1)

internal/resources/private_endpoint_service.go:277

  • Missing space in this error detail string produces a garbled message ("enablingprivate...").
		resp.Diagnostics.AddError(
			"Error "+status+" private endpoint service",
			"Error "+status+"private endpoint service, unexpected error: "+err.Error(),
		)

Comment thread internal/resources/private_endpoint_service.go Outdated
Comment thread internal/resources/private_endpoints.go
Comment thread internal/resources/private_endpoint_service.go Outdated
Comment thread docs/resources/private_endpoint_service.md
Comment thread docs/data-sources/private_endpoint_service.md
}

diags = resp.State.Set(ctx, &refreshedState)
if refreshedState.Status.ValueString() == statusEnableFailed {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read removes the resource on enableFailed without running cleanup - unlike Create/Update, the next apply will re-POST enable onto orphaned infra. Should this route through cleanupFailedEnable too?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read is called during plan/refresh and must be side-effect free; issuing a teardown (DELETE on cloud infra) from Read is a Terraform anti-pattern and will surprise users by destroying things during a plan.

Comment thread internal/resources/private_endpoint_service.go Outdated
Comment thread internal/resources/private_endpoint_service.go
Comment thread internal/resources/private_endpoint_service.go Outdated
@cloudy-vishnu cloudy-vishnu requested a review from PaulomeeCb June 19, 2026 02:23

@saiakhil2012 saiakhil2012 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants