fix(crud-service): self-recover from transient postgres pool init errors (#911)#1087
Merged
Conversation
…ors (#911) BaseRepository.check_pool_health short-circuited on a stale _pool_init_error: a transient startup IMDS hiccup permanently bricked /ready into 503, even though a fresh attempt could connect successfully. Verified live on holidaypeakhub405-dev-aks: TCP to Postgres works, DefaultAzureCredential gets a 1980-byte token, asyncpg.connect + SELECT 1 succeeds from the same pod where /ready returns 'TimeoutError: ; latest: TimeoutError: ' indefinitely. Failure mode masked the real issue tracked as #911 (commit-rendered-manifests). Flux IS reconciling main@HEAD, but the crud Kustomization stays Progressing forever because the deployment's readiness probe is stuck. The 'holiday-peak-agents' Kustomization waits on the crud dependency, so the entire GitOps pipeline appears bricked. Fix: always re-attempt initialize_pool() when the pool is absent. Cached _pool_init_error is now reported only when the retry itself fails. On success, clear the cached error so subsequent callers see a clean state. Tests: 4 new unit tests covering happy path, stale-error recovery, retry failure path, and explicit no-short-circuit assertion. Existing test_health.py recovery tests continue to pass (10/10 health tests, 265/265 crud-service tests). Pre-push gate: 705 passed across lib + apps.
This was referenced May 9, 2026
Cataldir
added a commit
that referenced
this pull request
May 10, 2026
…) (#1091) The chart's postgres-auth-preflight init container at .kubernetes/chart/templates/deployment.yaml lines 70-156 installs psql/jq via 'apk add', assuming the base image (mcr.microsoft.com/azure-cli:latest) is Alpine. The image is now Mariner-based, so 'apk' is missing -> Exit 127 -> CrashLoopBackOff -> CRUD pod never starts. Triggered an outage during PR #1090 (Pattern A Helm takeover): the legacy live pod was using a cached Alpine layer (5h+ uptime), but the freshly pulled image broke when Helm's rolling update created a new ReplicaSet. Recovery required manually stripping initContainers from the live Deployment and suspending the HelmRelease. Fix: flip preflight.postgresAuth.enabled from true to false in the HelmRelease values for crud-service. Safe because BaseRepository.check_pool_health self-recovers from transient pool init errors per commit 811fdbe (#911 / PR #1087) - the preflight gate is no longer load-bearing. Follow-up issues to file: (1) chart fix to support multi-distro package install (apk/tdnf/apt-get) or pin an Alpine-tagged image; (2) ADR-017 addendum documenting the prune-vs-Helm-adopt race that hit during this incident. Verified: cluster currently serving CRUD /health 200 OK and all 26 agents 200 OK after manual recovery. This PR brings GitOps state in sync with live so the HelmRelease can be unsuspended without regression.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #911
Live diagnosis
Verified on
holidaypeakhub405-dev-aks(running cluster):Flux GitRepository IS reconciled to current
main@6b75f40d— the originalcommit-rendered-manifestsfailure described in [P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911 is no longer the active blocker.holiday-peak-gitops-holiday-peak-crudKustomization is permanently stuck inReconciling/Progressingbecause thecrud-serviceDeployment never goes Ready.holiday-peak-gitops-holiday-peak-agentsKustomization isFalsewithdependency 'flux-system/holiday-peak-gitops-holiday-peak-crud' is not ready, so the entire downstream pipeline is gated.The CRUD pod runs and
/healthreturns 200, but/readyreturns 503 with:{"status":"degraded","checks":{"postgres":{"status":"unhealthy","detail":"TimeoutError: ; latest: TimeoutError: "}}}Inside the same pod, a fresh
DefaultAzureCredentialreturns a 1980-byte Entra token in <1s andasyncpg.connect(...)followed bySELECT 1succeeds. Postgres is reachable; only the in-process pool state is broken.Root cause
BaseRepository.check_pool_health()short-circuited on a cached_pool_init_error:A single transient IMDS hiccup at startup (
asyncio.TimeoutError) populated_pool_init_errorand permanently bricked the readiness check, even though every subsequent retry would have succeeded.Fix
apps/crud-service/src/crud_service/repositories/base.py—check_pool_healthnow always re-attemptsinitialize_pool()when the pool is absent. The cached error is reported only when the retry itself fails, and is cleared on success.Tests
apps/crud-service/tests/unit/test_base_repository_pool_health.py:_pool_init_erroris cleared on successful retry (regression for [P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911)initialize_poolIS called even when an error is cached)test_health.pyrecovery tests continue to pass (10/10).265/265.705 passedacross lib + apps; isort/black/pylint/mypy/governance-link/event-schema all green.Operational notes
After merge + redeploy, the cluster operator can restart the deployment to pick up the fix:
Once the CRUD pod is Ready, the Flux Kustomization will move to
Ready=Trueand theholiday-peak-agentsKustomization will unblock automatically.The original
commit-rendered-manifestsfailure described in [P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911 is no longer reproducible (Flux is on currentmain); if it returns, that's a separate workflow regression to track in a new issue.ADR / governance
Wait: true,Timeout: 10m0s).BaseRepository.