Skip to content

fix(crud-service): self-recover from transient postgres pool init errors (#911)#1087

Merged
Cataldir merged 1 commit into
mainfrom
bug/911-postgres-pool-recovery
May 9, 2026
Merged

fix(crud-service): self-recover from transient postgres pool init errors (#911)#1087
Cataldir merged 1 commit into
mainfrom
bug/911-postgres-pool-recovery

Conversation

@Cataldir

@Cataldir Cataldir commented May 9, 2026

Copy link
Copy Markdown
Contributor

Closes #911

Live diagnosis

Verified on holidaypeakhub405-dev-aks (running cluster):

  • Flux GitRepository IS reconciled to current main@6b75f40d — the original commit-rendered-manifests failure described in [P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911 is no longer the active blocker.

  • holiday-peak-gitops-holiday-peak-crud Kustomization is permanently stuck in Reconciling/Progressing because the crud-service Deployment never goes Ready.

  • holiday-peak-gitops-holiday-peak-agents Kustomization is False with dependency 'flux-system/holiday-peak-gitops-holiday-peak-crud' is not ready, so the entire downstream pipeline is gated.

  • The CRUD pod runs and /health returns 200, but /ready returns 503 with:

    {"status":"degraded","checks":{"postgres":{"status":"unhealthy","detail":"TimeoutError: ; latest: TimeoutError: "}}}
  • Inside the same pod, a fresh DefaultAzureCredential returns a 1980-byte Entra token in <1s and asyncpg.connect(...) followed by SELECT 1 succeeds. Postgres is reachable; only the in-process pool state is broken.

Root cause

BaseRepository.check_pool_health() short-circuited on a cached _pool_init_error:

if cls._pool_init_error and cls._pool is None:
    return "unhealthy", cls._pool_init_error

A single transient IMDS hiccup at startup (asyncio.TimeoutError) populated _pool_init_error and permanently bricked the readiness check, even though every subsequent retry would have succeeded.

Fix

apps/crud-service/src/crud_service/repositories/base.pycheck_pool_health now always re-attempts initialize_pool() when the pool is absent. The cached error is reported only when the retry itself fails, and is cleared on success.

Tests

  • 4 new unit tests in apps/crud-service/tests/unit/test_base_repository_pool_health.py:
  • All 6 existing test_health.py recovery tests continue to pass (10/10).
  • Full crud-service suite: 265/265.
  • Pre-push gate: 705 passed across lib + apps; isort/black/pylint/mypy/governance-link/event-schema all green.

Operational notes

  • After merge + redeploy, the cluster operator can restart the deployment to pick up the fix:

    kubectl rollout restart deployment/crud-service-crud-service -n holiday-peak-crud
    
  • Once the CRUD pod is Ready, the Flux Kustomization will move to Ready=True and the holiday-peak-agents Kustomization will unblock automatically.

  • The original commit-rendered-manifests failure described in [P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911 is no longer reproducible (Flux is on current main); if it returns, that's a separate workflow regression to track in a new issue.

ADR / governance

  • ADR-033 (Flux GitOps deployment model) — fix preserves the readiness contract used by the Kustomization health check (Wait: true, Timeout: 10m0s).
  • No public surface change to BaseRepository.

…ors (#911)

BaseRepository.check_pool_health short-circuited on a stale _pool_init_error: a transient startup IMDS hiccup permanently bricked /ready into 503, even though a fresh attempt could connect successfully. Verified live on holidaypeakhub405-dev-aks: TCP to Postgres works, DefaultAzureCredential gets a 1980-byte token, asyncpg.connect + SELECT 1 succeeds from the same pod where /ready returns 'TimeoutError: ; latest: TimeoutError: ' indefinitely.

Failure mode masked the real issue tracked as #911 (commit-rendered-manifests). Flux IS reconciling main@HEAD, but the crud Kustomization stays Progressing forever because the deployment's readiness probe is stuck. The 'holiday-peak-agents' Kustomization waits on the crud dependency, so the entire GitOps pipeline appears bricked.

Fix: always re-attempt initialize_pool() when the pool is absent. Cached _pool_init_error is now reported only when the retry itself fails. On success, clear the cached error so subsequent callers see a clean state.

Tests: 4 new unit tests covering happy path, stale-error recovery, retry failure path, and explicit no-short-circuit assertion. Existing test_health.py recovery tests continue to pass (10/10 health tests, 265/265 crud-service tests). Pre-push gate: 705 passed across lib + apps.
@Cataldir Cataldir enabled auto-merge (squash) May 9, 2026 15:25
@Cataldir Cataldir merged commit 811fdbe into main May 9, 2026
12 of 13 checks passed
@Cataldir Cataldir deleted the bug/911-postgres-pool-recovery branch May 9, 2026 15:28
Cataldir added a commit that referenced this pull request May 10, 2026
…) (#1091)

The chart's postgres-auth-preflight init container at .kubernetes/chart/templates/deployment.yaml lines 70-156 installs psql/jq via 'apk add', assuming the base image (mcr.microsoft.com/azure-cli:latest) is Alpine. The image is now Mariner-based, so 'apk' is missing -> Exit 127 -> CrashLoopBackOff -> CRUD pod never starts.

Triggered an outage during PR #1090 (Pattern A Helm takeover): the legacy live pod was using a cached Alpine layer (5h+ uptime), but the freshly pulled image broke when Helm's rolling update created a new ReplicaSet. Recovery required manually stripping initContainers from the live Deployment and suspending the HelmRelease.

Fix: flip preflight.postgresAuth.enabled from true to false in the HelmRelease values for crud-service. Safe because BaseRepository.check_pool_health self-recovers from transient pool init errors per commit 811fdbe (#911 / PR #1087) - the preflight gate is no longer load-bearing.

Follow-up issues to file: (1) chart fix to support multi-distro package install (apk/tdnf/apt-get) or pin an Alpine-tagged image; (2) ADR-017 addendum documenting the prune-vs-Helm-adopt race that hit during this incident.

Verified: cluster currently serving CRUD /health 200 OK and all 26 agents 200 OK after manual recovery. This PR brings GitOps state in sync with live so the HelmRelease can be unsuspended without regression.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates

1 participant