Skip to content

fix(crud): disable broken postgres-auth preflight init container (#1088)#1091

Merged
Cataldir merged 1 commit into
mainfrom
bug/1088-disable-crud-preflight
May 10, 2026
Merged

fix(crud): disable broken postgres-auth preflight init container (#1088)#1091
Cataldir merged 1 commit into
mainfrom
bug/1088-disable-crud-preflight

Conversation

@Cataldir

Copy link
Copy Markdown
Contributor

Summary

Disable the broken postgres-auth-preflight init container in the CRUD HelmRelease values to restore Flux GitOps management of crud-service.

Context (incident)

PR #1090 finalized the Pattern A Helm takeover for crud-service (per ADR-017). Flux installed the HelmRelease, and the rendered Deployment caused a rolling update because pod template hashes differed from the live azd deploy-applied spec.

The new pod failed to start because:

  1. The chart's postgres-auth-preflight init container at .kubernetes/chart/templates/deployment.yaml#L70-L156 installs psql and jq via apk add, assuming the base image (mcr.microsoft.com/azure-cli:latest) is Alpine.
  2. The image is now Mariner-based, so apk is missing → Exit 127CrashLoopBackOff.
  3. The legacy live pod (5h+ uptime) was using a cached Alpine layer, so it kept running. Helm's rolling update pulled a fresh image and broke.

A second race (legacy ALB pruned by kustomize-controller after Helm adoption) compounded the impact, but that part has been recovered by re-applying the Helm-rendered manifest. Cluster currently shows CRUD /health 200 OK and 26/26 agents 200 OK.

Change

Flip preflight.postgresAuth.enabled from true to false in .kubernetes/releases/crud/crud-service.yaml.

Why this is safe

Post-merge plan

  1. Once Flux GitRepository advances to the merge SHA, unsuspend the HelmRelease:
    kubectl patch helmrelease crud-service -n flux-system --type=merge -p '{"spec":{"suspend":false}}'
  2. Helm will reconcile. Rendered Deployment (no init container) matches the current live drift → quiet rollout.

Follow-ups (separate PRs)

  • Chart fix: replace apk add with multi-distro detection (apk/tdnf/apt-get) or pin an Alpine-tagged azure-cli image, then re-enable preflight where appropriate.
  • ADR-017 addendum: document the prune-vs-Helm-adopt race and add a pre-flight verification step (chart vs live spec diff) for Pattern A takeovers.

Verification

  • Pre-push gate: 1326 lib tests + 705 app tests pass; pylint 9.91/10; mypy clean.
  • External AGC probe: http://esbcc8bcfyazbbdg.fz03.alb.azure.com/health → HTTP 200; 26/26 agent /<service>/health endpoints → HTTP 200.

Closes #1088 (incident remediation tracking).

The chart's postgres-auth-preflight init container at .kubernetes/chart/templates/deployment.yaml lines 70-156 installs psql/jq via 'apk add', assuming the base image (mcr.microsoft.com/azure-cli:latest) is Alpine. The image is now Mariner-based, so 'apk' is missing -> Exit 127 -> CrashLoopBackOff -> CRUD pod never starts.

Triggered an outage during PR #1090 (Pattern A Helm takeover): the legacy live pod was using a cached Alpine layer (5h+ uptime), but the freshly pulled image broke when Helm's rolling update created a new ReplicaSet. Recovery required manually stripping initContainers from the live Deployment and suspending the HelmRelease.

Fix: flip preflight.postgresAuth.enabled from true to false in the HelmRelease values for crud-service. Safe because BaseRepository.check_pool_health self-recovers from transient pool init errors per commit 811fdbe (#911 / PR #1087) - the preflight gate is no longer load-bearing.

Follow-up issues to file: (1) chart fix to support multi-distro package install (apk/tdnf/apt-get) or pin an Alpine-tagged image; (2) ADR-017 addendum documenting the prune-vs-Helm-adopt race that hit during this incident.

Verified: cluster currently serving CRUD /health 200 OK and all 26 agents 200 OK after manual recovery. This PR brings GitOps state in sync with live so the HelmRelease can be unsuspended without regression.
@Cataldir Cataldir merged commit ffcb500 into main May 10, 2026
13 checks passed
@Cataldir Cataldir deleted the bug/1088-disable-crud-preflight branch May 10, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant