Skip to content

Commit ffcb500

Browse files
authored
fix(crud): disable broken postgres-auth preflight init container (#1088) (#1091)
The chart's postgres-auth-preflight init container at .kubernetes/chart/templates/deployment.yaml lines 70-156 installs psql/jq via 'apk add', assuming the base image (mcr.microsoft.com/azure-cli:latest) is Alpine. The image is now Mariner-based, so 'apk' is missing -> Exit 127 -> CrashLoopBackOff -> CRUD pod never starts. Triggered an outage during PR #1090 (Pattern A Helm takeover): the legacy live pod was using a cached Alpine layer (5h+ uptime), but the freshly pulled image broke when Helm's rolling update created a new ReplicaSet. Recovery required manually stripping initContainers from the live Deployment and suspending the HelmRelease. Fix: flip preflight.postgresAuth.enabled from true to false in the HelmRelease values for crud-service. Safe because BaseRepository.check_pool_health self-recovers from transient pool init errors per commit 811fdbe (#911 / PR #1087) - the preflight gate is no longer load-bearing. Follow-up issues to file: (1) chart fix to support multi-distro package install (apk/tdnf/apt-get) or pin an Alpine-tagged image; (2) ADR-017 addendum documenting the prune-vs-Helm-adopt race that hit during this incident. Verified: cluster currently serving CRUD /health 200 OK and all 26 agents 200 OK after manual recovery. This PR brings GitOps state in sync with live so the HelmRelease can be unsuspended without regression.
1 parent 058fe32 commit ffcb500

1 file changed

Lines changed: 6 additions & 1 deletion

File tree

.kubernetes/releases/crud/crud-service.yaml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,12 @@ spec:
5858
enabled: false
5959
preflight:
6060
postgresAuth:
61-
enabled: true
61+
# Disabled: the chart's init container assumes Alpine (apk add postgresql-client jq),
62+
# but mcr.microsoft.com/azure-cli:latest is now Mariner -> apk missing -> Exit 127 ->
63+
# CrashLoopBackOff. BaseRepository.check_pool_health self-recovers from transient pool
64+
# init errors per fix 811fdbe6 (#911 / PR #1087), so this preflight gate is no longer
65+
# load-bearing. Tracked for permanent multi-distro fix in chart in a follow-up.
66+
enabled: false
6267
agc:
6368
enabled: true
6469
gatewayClassName: azure-alb-external

0 commit comments

Comments
 (0)