Skip to content

Commit 9df66e8

Browse files
fix(backup): relax MinIO liveness probe to stop daily SIGKILL restarts (#133)
The rig-prd-backup MinIO sets no timeoutSeconds/failureThreshold, so Kubernetes applies the defaults (1s / 3). MinIO can't reliably answer /minio/health/live within 1s during Ceph-RBD I/O spikes, so the kubelet SIGKILLs it (exit 137, reason Error) ~daily -- 60 restarts in 31 days. A backup target killed mid-write risks failing a nightly backup run. Set timeoutSeconds: 5, failureThreshold: 5 (period stays 30s -> ~150s of sustained unresponsiveness before a kill). Memory left at 256Mi/512Mi: the pod is never OOMKilled and sits at ~371Mi, so a limit bump isn't justified.
1 parent aa63c5a commit 9df66e8

1 file changed

Lines changed: 7 additions & 0 deletions

File tree

  • infrastructure/bootstrap/infrastructure/backup-destination/controller/base

infrastructure/bootstrap/infrastructure/backup-destination/controller/base/deployment.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,13 @@ spec:
5252
port: 9000
5353
initialDelaySeconds: 10
5454
periodSeconds: 30
55+
# Explicit, tolerant values. The k8s defaults (timeoutSeconds 1,
56+
# failureThreshold 3) SIGKILL this pod whenever MinIO can't answer
57+
# the health check within 1s during Ceph-RBD I/O spikes -- the cause
58+
# of the ~daily exit-137 restarts. Memory is fine (never OOMKilled,
59+
# ~371Mi of 512Mi), so this is the actual fix.
60+
timeoutSeconds: 5
61+
failureThreshold: 5
5562
resources:
5663
requests:
5764
memory: "256Mi"

0 commit comments

Comments
 (0)