[AV-134634] Fix - Fail fast and auto-recover from private endpoint service failed states by cloudy-vishnu · Pull Request #647 · couchbasecloud/terraform-provider-couchbase-capella

cloudy-vishnu · 2026-06-16T06:26:39Z

Jira

AV-134634

Description

When private endpoint service enablement fails, the backend reaches a terminal enableFailed state, but the provider only ever saw {"enabled": false}. It polled that boolean for 60 minutes, timed out with a generic error, and re-POSTed enable on the next apply, looping over orphaned infra with no automated recovery. Failed endpoint associations had a similar problem: they stayed in state forever.

This PR makes the provider status-aware so it fails fast on terminal states, automatically cleans up a failed enable, and lets the next apply perform a clean re-create with no manual intervention. It pairs with the control-plane change that exposes status and allows disable from enableFailed; it is fully backward compatible with control planes that do not yet report a status.

Changes

API model

Added an optional Status *string to GetPrivateEndpointServiceStatusResponse.

Status-aware polling

Rewrote waitUntilStatusChanges to use the lifecycle status when present: enableFailed/disableFailed return immediately with a typed terminal error; enabling/disabling/unknown keep polling (transient); enabled/disabled/idle resolve against the enabled boolean; status absent (older control plane) falls back to the boolean with unchanged behavior.
Reduced the poll interval to 30s and kept the 60-minute timeout as a backstop.

Terminal-failure handling

Create and enable-flavored Update: on enableFailed, issue a disable (DELETE) to trigger backend cleanup, wait up to 15 minutes for disabled/idle, remove the resource from state, and return an actionable error. The resource is removed even if cleanup itself fails, with a message directing escalation, because leaving a permanently-failed resource in state recreates the stuck-pipeline problem.
Delete: on disableFailed, fail fast and keep the resource in state so a re-run of terraform destroy retries naturally.
Read: on enableFailed, remove the resource from state so drift detection recreates it instead of leaving it wedged.
Added ErrPrivateEndpointServiceEnableFailed and ErrPrivateEndpointServiceDisableFailed.

Failed associations

private_endpoints Read now removes both rejected and failed associations from state to force a clean re-association.

Status exposure, schema, and docs

Added a computed status attribute to the resource and data source, populated from the API.
Documented terminal vs transient states in the resource and data source docs.

Compatibility

The change only activates when the API reports a status. Against an older control plane the provider falls back to boolean polling, so behavior is unchanged. Recommended rollout is control plane first, then this provider release.
The computed status attribute follows the same pattern as the existing current_state attribute on the cluster resource (plain computed, planned as known-after-apply), so enable/disable toggles do not cause inconsistent-result errors.

Type of Change

Bug fix (non-breaking change which fixes an issue).
New feature (non-breaking change which adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change updates the ci/cd workflow.
Documentation fix/enhancement.

Manual Testing Approach

How was this change tested and do you have evidence? (REQUIRED: Select at least 1)

Manually tested
Unit tested
Acceptance tested
Unable to test / will not test (Please provide comments in section below)

Testing

Unit tests

go vet ./... is clean and make test passes (run via make test so OPENAPI_SPEC_URL is exported; a bare go test leaves schema-description tests without the spec and is expected to fail). Coverage includes the new status-aware poll loop, the terminal-failure handling, and the failed/rejected association removal.

Acceptance test

TestAccPrivateEndpointServiceEnableDisable passes against real Capella (enable, read, disable):

--- PASS: TestAccPrivateEndpointServiceEnableDisable (273.89s)

The other failures in the suite run are unrelated environment/backend issues (bucket-quota exhaustion on the shared cluster and transient 500s during bucket creation), not changes in this PR.

### Manual end-to-end against a local control plane

Validated the full lifecycle against a local `cbclocal` control plane (which exposes the new `status` field), using a dev-override build of this provider.

**Enable (`terraform apply`)**
- Plan renders `status = (known after apply)`, confirming the computed attribute is planned as unknown, so it cannot trigger a "provider produced inconsistent result after apply" error when the value changes.
- The status-aware poll loop polls the service status until the backend job completes and converges (creation completed after ~5m31s).
- `terraform show` reports the resolved state:
- resource "couchbase-capella_private_endpoint_service" "svc" {

cluster_id      = "f7c7e93e-554e-46bc-bcd0-1d7cd7eff9cb"

enabled         = true

organization_id = "adb4fb4c-1d98-4287-ac33-230742d2cc76"

project_id      = "c09035e8-3971-4b79-a6ec-bccd9056041f"

status          = "enabled"

}
**Disable (`terraform destroy`)**
- Refresh succeeds and the service is disabled cleanly (destroy completed after ~1m):
- Destroy complete! Resources: 1 destroyed.

This exercises status exposure, the computed status attribute, the 30s status-aware polling, and the enable/disable lifecycle against a real backend with no errors.

Terminal-failure path

A healthy cluster enables successfully, so the enableFailed recovery path is not reachable from the happy-path apply above. It is covered by unit tests (fail-fast on enableFailed, DELETE-triggered cleanup, RemoveResource, and the cleanup-failure escalation), and can be exercised manually with a mock control plane returning {"enabled": false, "status": "enableFailed"} on GET and 202 on DELETE, or by inducing an enable-job failure in the backend. Verified behavior in that case: apply fails fast (within one poll interval, not the 60-minute timeout), the resource is removed from state, and a subsequent terraform apply performs a clean re-create with no manual intervention.

Further comments

github-actions · 2026-06-16T06:26:55Z

🚨 PR title "AV-134634: [FEAT] - Fail fast and auto-recover from private endpoint service failed states" does not match the required format.

Requirements:
- Must start with [AV-XXXXX] where X is any number of digits
- After the bracket, must start with a Verb (Add, Update, Fix, etc.)
- The Verb must start with an uppercase letter

Expected format: [AV-XXXXX] Verb ...
Example: [AV-98659] Implement Cluster On/Off feature
Valid verbs: Add, Update, Fix, Implement, Remove, Refactor, etc.

github-actions · 2026-06-16T06:55:40Z

🚨 PR title "AV-134634: FIX - Fail fast and auto-recover from private endpoint service failed states" does not match the required format.

Requirements:
- Must start with [AV-XXXXX] where X is any number of digits
- After the bracket, must start with a Verb (Add, Update, Fix, etc.)
- The Verb must start with an uppercase letter

Expected format: [AV-XXXXX] Verb ...
Example: [AV-98659] Implement Cluster On/Off feature
Valid verbs: Add, Update, Fix, Implement, Remove, Refactor, etc.

IsaacLambat

mostly nit, though the other reviewers comments make sense.

Code comments are quite excessive at times, so I think lots can be trimmed/removed. Aim for clean code that describes what is happening so we don't need to have comments explaining whats happening.

…ouchbase-capella into AV-134634-fix-terraform-polling-failed-state

Copilot

Pull request overview

This PR makes the private endpoint service resource/data source “status-aware” by consuming a new optional lifecycle status from the Capella API, enabling fail-fast behavior on terminal failed states and automated cleanup/remediation paths to avoid permanently wedged Terraform state.

Changes:

Extend the private endpoint service status API model to include an optional status lifecycle field and surface it as a computed status attribute in the resource/data source.
Rewrite the enable/disable poll loop to use lifecycle status when present (fail fast on enableFailed/disableFailed, keep polling on transient states, fallback to boolean on older control planes), and add recovery logic for terminal enable failure (trigger DELETE cleanup + remove from state).
Improve behavior for failed private endpoint associations by removing failed/rejected associations from state to force clean re-association; update docs accordingly.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
internal/schema/private_endpoint_service.go	Adds `status` to the Terraform model for private endpoint service.
internal/resources/private_endpoints.go	Removes terminal `failed`/`rejected` associations from state during Read.
internal/resources/private_endpoint_service.go	Adds lifecycle status constants, status-aware polling, terminal failure handling (cleanup/remove state), and exposes status in state.
internal/resources/private_endpoint_service_test.go	Adds unit coverage for status-aware polling and terminal failure cleanup/removal logic.
internal/resources/private_endpoint_service_schema.go	Adds computed `status` attribute and documents terminal enable failure behavior in schema description.
internal/errors/errors.go	Adds typed terminal errors for enable/disable failed states.
internal/datasources/private_endpoint_service.go	Populates computed `status` from the API when available.
internal/datasources/private_endpoint_service_schema.go	Adds computed `status` attribute to the data source schema.
internal/api/private_endpoint_service.go	Adds optional `Status *string` to the API response model.
docs/resources/private_endpoint_service.md	Documents `status` and enableFailed recovery behavior for the resource.
docs/resources/cluster.md	Improves `deletion_protection` documentation text.
docs/data-sources/private_endpoint_service.md	Documents `status` for the data source.
docs/data-sources/free_tier_clusters.md	Improves `deletion_protection` documentation text.
docs/data-sources/clusters.md	Improves `deletion_protection` documentation text.

Comments suppressed due to low confidence (1)

internal/resources/private_endpoint_service.go:277

Missing space in this error detail string produces a garbled message ("enablingprivate...").

		resp.Diagnostics.AddError(
			"Error "+status+" private endpoint service",
			"Error "+status+"private endpoint service, unexpected error: "+err.Error(),
		)

PaulomeeCb · 2026-06-18T22:35:20Z

 	}

-	diags = resp.State.Set(ctx, &refreshedState)
+	if refreshedState.Status.ValueString() == statusEnableFailed {


Read removes the resource on enableFailed without running cleanup - unlike Create/Update, the next apply will re-POST enable onto orphaned infra. Should this route through cleanupFailedEnable too?

Read is called during plan/refresh and must be side-effect free; issuing a teardown (DELETE on cloud infra) from Read is a Terraform anti-pattern and will surprise users by destroying things during a plan.

saiakhil2012

LGTM

[FEAT]: added failed cleanup and status fields

2104300

github-actions Bot added bug Something isn't working enhancement New feature or request labels Jun 16, 2026

cloudy-vishnu changed the title ~~[AV-134634]: [FEAT] - Fail fast and auto-recover from private endpoint service failed states~~ [AV-134634]: FIX - Fail fast and auto-recover from private endpoint service failed states Jun 16, 2026

cloudy-vishnu changed the title ~~[AV-134634]: FIX - Fail fast and auto-recover from private endpoint service failed states~~ [AV-134634] Fix - Fail fast and auto-recover from private endpoint service failed states Jun 16, 2026

cloudy-vishnu marked this pull request as ready for review June 16, 2026 21:09

cloudy-vishnu requested a review from a team as a code owner June 16, 2026 21:09

couchbasecloud deleted a comment from factory-droid Bot Jun 16, 2026

saiakhil2012 reviewed Jun 17, 2026

View reviewed changes

Comment thread internal/resources/private_endpoint_service.go

Comment thread docs/resources/private_endpoint_service.md

IsaacLambat reviewed Jun 17, 2026

View reviewed changes

Comment thread docs/data-sources/private_endpoint_service.md

Comment thread internal/resources/private_endpoint_service.go Outdated

Comment thread internal/resources/private_endpoint_service.go Outdated

Comment thread internal/resources/private_endpoint_service.go Outdated

cloudy-vishnu added 3 commits June 18, 2026 18:53

Merge branch 'main' of github.com:couchbasecloud/terraform-provider-c…

5c65346

…ouchbase-capella into AV-134634-fix-terraform-polling-failed-state

[FEAT]: addressed comments

95710b2

[FIX]: fixed tests

239ec1e

stanleefdz requested a review from Copilot June 18, 2026 14:04

Copilot started reviewing on behalf of stanleefdz June 18, 2026 14:04 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

[FEAT]: addressed comments

2b692bc

cloudy-vishnu requested review from IsaacLambat and saiakhil2012 June 18, 2026 14:36

cloudy-vishnu self-assigned this Jun 18, 2026

PaulomeeCb reviewed Jun 18, 2026

View reviewed changes

Comment thread internal/resources/private_endpoint_service.go Outdated

PaulomeeCb reviewed Jun 18, 2026

View reviewed changes

Comment thread internal/resources/private_endpoint_service.go

PaulomeeCb reviewed Jun 18, 2026

View reviewed changes

Comment thread internal/resources/private_endpoint_service.go Outdated

[FEAT]: addressed comments

ec6ed8e

cloudy-vishnu requested a review from PaulomeeCb June 19, 2026 02:23

saiakhil2012 approved these changes Jun 20, 2026

View reviewed changes

Conversation

cloudy-vishnu commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jira

Description

Changes

API model

Status-aware polling

Terminal-failure handling

Failed associations

Status exposure, schema, and docs

Compatibility

Type of Change

Manual Testing Approach

How was this change tested and do you have evidence? (REQUIRED: Select at least 1)

Testing

Unit tests

Acceptance test

Terminal-failure path

Further comments

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 16, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IsaacLambat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PaulomeeCb Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

cloudy-vishnu Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saiakhil2012 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloudy-vishnu commented Jun 16, 2026 •

edited

Loading

github-actions Bot commented Jun 16, 2026 •

edited by atlassian Bot

Loading

github-actions Bot commented Jun 16, 2026 •

edited by atlassian Bot

Loading