fix(crud-service): self-recover from transient postgres pool init errors (#911) by Cataldir · Pull Request #1087 · Azure-Samples/holiday-peak-hub

Cataldir · 2026-05-09T15:25:30Z

Closes #911

Live diagnosis

Verified on holidaypeakhub405-dev-aks (running cluster):

Flux GitRepository IS reconciled to current main@6b75f40d — the original commit-rendered-manifests failure described in [P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911 is no longer the active blocker.
holiday-peak-gitops-holiday-peak-crud Kustomization is permanently stuck in Reconciling/Progressing because the crud-service Deployment never goes Ready.
holiday-peak-gitops-holiday-peak-agents Kustomization is False with dependency 'flux-system/holiday-peak-gitops-holiday-peak-crud' is not ready, so the entire downstream pipeline is gated.

The CRUD pod runs and /health returns 200, but /ready returns 503 with:

{"status":"degraded","checks":{"postgres":{"status":"unhealthy","detail":"TimeoutError: ; latest: TimeoutError: "}}}

Inside the same pod, a fresh DefaultAzureCredential returns a 1980-byte Entra token in <1s and asyncpg.connect(...) followed by SELECT 1 succeeds. Postgres is reachable; only the in-process pool state is broken.

Root cause

BaseRepository.check_pool_health() short-circuited on a cached _pool_init_error:

if cls._pool_init_error and cls._pool is None:
    return "unhealthy", cls._pool_init_error

A single transient IMDS hiccup at startup (asyncio.TimeoutError) populated _pool_init_error and permanently bricked the readiness check, even though every subsequent retry would have succeeded.

Fix

apps/crud-service/src/crud_service/repositories/base.py — check_pool_health now always re-attempts initialize_pool() when the pool is absent. The cached error is reported only when the retry itself fails, and is cleared on success.

Tests

4 new unit tests in apps/crud-service/tests/unit/test_base_repository_pool_health.py:
- happy path with initialized pool
- stale _pool_init_error is cleared on successful retry (regression for [P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911)
- retry failure reports the latest exception
- explicit no-short-circuit assertion (initialize_pool IS called even when an error is cached)
All 6 existing test_health.py recovery tests continue to pass (10/10).
Full crud-service suite: 265/265.
Pre-push gate: 705 passed across lib + apps; isort/black/pylint/mypy/governance-link/event-schema all green.

Operational notes

After merge + redeploy, the cluster operator can restart the deployment to pick up the fix:
```
kubectl rollout restart deployment/crud-service-crud-service -n holiday-peak-crud
```
Once the CRUD pod is Ready, the Flux Kustomization will move to Ready=True and the holiday-peak-agents Kustomization will unblock automatically.
The original commit-rendered-manifests failure described in [P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911 is no longer reproducible (Flux is on current main); if it returns, that's a separate workflow regression to track in a new issue.

ADR / governance

ADR-033 (Flux GitOps deployment model) — fix preserves the readiness contract used by the Kustomization health check (Wait: true, Timeout: 10m0s).
No public surface change to BaseRepository.

…ors (#911) BaseRepository.check_pool_health short-circuited on a stale _pool_init_error: a transient startup IMDS hiccup permanently bricked /ready into 503, even though a fresh attempt could connect successfully. Verified live on holidaypeakhub405-dev-aks: TCP to Postgres works, DefaultAzureCredential gets a 1980-byte token, asyncpg.connect + SELECT 1 succeeds from the same pod where /ready returns 'TimeoutError: ; latest: TimeoutError: ' indefinitely. Failure mode masked the real issue tracked as #911 (commit-rendered-manifests). Flux IS reconciling main@HEAD, but the crud Kustomization stays Progressing forever because the deployment's readiness probe is stuck. The 'holiday-peak-agents' Kustomization waits on the crud dependency, so the entire GitOps pipeline appears bricked. Fix: always re-attempt initialize_pool() when the pool is absent. Cached _pool_init_error is now reported only when the retry itself fails. On success, clear the cached error so subsequent callers see a clean state. Tests: 4 new unit tests covering happy path, stale-error recovery, retry failure path, and explicit no-short-circuit assertion. Existing test_health.py recovery tests continue to pass (10/10 health tests, 265/265 crud-service tests). Pre-push gate: 705 passed across lib + apps.

…) (#1091) The chart's postgres-auth-preflight init container at .kubernetes/chart/templates/deployment.yaml lines 70-156 installs psql/jq via 'apk add', assuming the base image (mcr.microsoft.com/azure-cli:latest) is Alpine. The image is now Mariner-based, so 'apk' is missing -> Exit 127 -> CrashLoopBackOff -> CRUD pod never starts. Triggered an outage during PR #1090 (Pattern A Helm takeover): the legacy live pod was using a cached Alpine layer (5h+ uptime), but the freshly pulled image broke when Helm's rolling update created a new ReplicaSet. Recovery required manually stripping initContainers from the live Deployment and suspending the HelmRelease. Fix: flip preflight.postgresAuth.enabled from true to false in the HelmRelease values for crud-service. Safe because BaseRepository.check_pool_health self-recovers from transient pool init errors per commit 811fdbe (#911 / PR #1087) - the preflight gate is no longer load-bearing. Follow-up issues to file: (1) chart fix to support multi-distro package install (apk/tdnf/apt-get) or pin an Alpine-tagged image; (2) ADR-017 addendum documenting the prune-vs-Helm-adopt race that hit during this incident. Verified: cluster currently serving CRUD /health 200 OK and all 26 agents 200 OK after manual recovery. This PR brings GitOps state in sync with live so the HelmRelease can be unsuspended without regression.

Cataldir enabled auto-merge (squash) May 9, 2026 15:25

Cataldir merged commit 811fdbe into main May 9, 2026
12 of 13 checks passed

Cataldir deleted the bug/911-postgres-pool-recovery branch May 9, 2026 15:28

This was referenced May 9, 2026

[P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911

Closed

fix(crud): disable broken postgres-auth preflight init container (#1088) #1091

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(crud-service): self-recover from transient postgres pool init errors (#911)#1087

fix(crud-service): self-recover from transient postgres pool init errors (#911)#1087
Cataldir merged 1 commit into
mainfrom
bug/911-postgres-pool-recovery

Cataldir commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Cataldir commented May 9, 2026

Closes #911

Live diagnosis

Root cause

Fix

Tests

Operational notes

ADR / governance

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant