fix(backup): relax MinIO liveness probe to stop daily SIGKILL restarts (#133)

uittenbroekrobbert · web-flow · commit 9df66e8e591a · 2026-06-12T14:58:46.000+02:00
The rig-prd-backup MinIO sets no timeoutSeconds/failureThreshold, so Kubernetes
applies the defaults (1s / 3). MinIO can't reliably answer /minio/health/live
within 1s during Ceph-RBD I/O spikes, so the kubelet SIGKILLs it (exit 137,
reason Error) ~daily -- 60 restarts in 31 days. A backup target killed mid-write
risks failing a nightly backup run.

Set timeoutSeconds: 5, failureThreshold: 5 (period stays 30s -&gt; ~150s of
sustained unresponsiveness before a kill). Memory left at 256Mi/512Mi: the pod
is never OOMKilled and sits at ~371Mi, so a limit bump isn't justified.
diff --git a/infrastructure/bootstrap/infrastructure/backup-destination/controller/base/deployment.yaml b/infrastructure/bootstrap/infrastructure/backup-destination/controller/base/deployment.yaml
@@ -52,6 +52,13 @@ spec:
               port: 9000
             initialDelaySeconds: 10
             periodSeconds: 30
+            # Explicit, tolerant values. The k8s defaults (timeoutSeconds 1,
+            # failureThreshold 3) SIGKILL this pod whenever MinIO can't answer
+            # the health check within 1s during Ceph-RBD I/O spikes -- the cause
+            # of the ~daily exit-137 restarts. Memory is fine (never OOMKilled,
+            # ~371Mi of 512Mi), so this is the actual fix.
+            timeoutSeconds: 5
+            failureThreshold: 5
           resources:
             requests:
               memory: "256Mi"