fix(crud): disable broken postgres-auth preflight init container (#1088)#1091
Merged
Conversation
The chart's postgres-auth-preflight init container at .kubernetes/chart/templates/deployment.yaml lines 70-156 installs psql/jq via 'apk add', assuming the base image (mcr.microsoft.com/azure-cli:latest) is Alpine. The image is now Mariner-based, so 'apk' is missing -> Exit 127 -> CrashLoopBackOff -> CRUD pod never starts. Triggered an outage during PR #1090 (Pattern A Helm takeover): the legacy live pod was using a cached Alpine layer (5h+ uptime), but the freshly pulled image broke when Helm's rolling update created a new ReplicaSet. Recovery required manually stripping initContainers from the live Deployment and suspending the HelmRelease. Fix: flip preflight.postgresAuth.enabled from true to false in the HelmRelease values for crud-service. Safe because BaseRepository.check_pool_health self-recovers from transient pool init errors per commit 811fdbe (#911 / PR #1087) - the preflight gate is no longer load-bearing. Follow-up issues to file: (1) chart fix to support multi-distro package install (apk/tdnf/apt-get) or pin an Alpine-tagged image; (2) ADR-017 addendum documenting the prune-vs-Helm-adopt race that hit during this incident. Verified: cluster currently serving CRUD /health 200 OK and all 26 agents 200 OK after manual recovery. This PR brings GitOps state in sync with live so the HelmRelease can be unsuspended without regression.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Disable the broken
postgres-auth-preflightinit container in the CRUD HelmRelease values to restore Flux GitOps management ofcrud-service.Context (incident)
PR #1090 finalized the Pattern A Helm takeover for
crud-service(per ADR-017). Flux installed the HelmRelease, and the rendered Deployment caused a rolling update because pod template hashes differed from the liveazd deploy-applied spec.The new pod failed to start because:
postgres-auth-preflightinit container at .kubernetes/chart/templates/deployment.yaml#L70-L156 installspsqlandjqviaapk add, assuming the base image (mcr.microsoft.com/azure-cli:latest) is Alpine.apkis missing →Exit 127→CrashLoopBackOff.A second race (legacy ALB pruned by
kustomize-controllerafter Helm adoption) compounded the impact, but that part has been recovered by re-applying the Helm-rendered manifest. Cluster currently shows CRUD/health200 OK and 26/26 agents 200 OK.Change
Flip
preflight.postgresAuth.enabledfromtruetofalsein .kubernetes/releases/crud/crud-service.yaml.Why this is safe
BaseRepository.check_pool_healthself-recovers from transient pool init errors per commit 811fdbe (PR fix(crud-service): self-recover from transient postgres pool init errors (#911) #1087 / fixes [P1] deploy-azd: commit-rendered-manifests step fails, blocking Flux reconciliation and AKS image updates #911). The preflight gate is no longer load-bearing.crud-servicetest suite (265 tests includingtest_base_repository_pool_health.py) covers the self-healing path.Post-merge plan
kubectl patch helmrelease crud-service -n flux-system --type=merge -p '{"spec":{"suspend":false}}'Follow-ups (separate PRs)
apk addwith multi-distro detection (apk/tdnf/apt-get) or pin an Alpine-taggedazure-cliimage, then re-enable preflight where appropriate.Verification
http://esbcc8bcfyazbbdg.fz03.alb.azure.com/health→ HTTP 200; 26/26 agent/<service>/healthendpoints → HTTP 200.Closes #1088 (incident remediation tracking).