-
Notifications
You must be signed in to change notification settings - Fork 73
Description
Hello,
We are testing Dragonfly & the operator (which seems very nice globally 👍) and got an issue: the operator seems stuck in a reconciliation loop about replication in a specific situation: after a failover which didn't worked smoothly because of OOM of the container.
And it never finishes, because (if we understood properly) the "replicaof" command is not applied again, leaving the cluster in a dead-end situation.
How to reproduce
We start with an instance having some data stored inside.
- Initial state: a healthly simple master+replica setup :
=> pod 0 is indeed master, it has the replica connected, and the replication works ✅
k get pods -lapp=dragonfly-sample -L role NAME READY STATUS RESTARTS AGE ROLE dragonfly-sample-0 2/2 Running 0 2m8s master dragonfly-sample-1 2/2 Running 0 104s replica k exec dragonfly-sample-0 -- redis-cli info replication role:master connected_slaves:1 slave0:ip=127.0.0.6,port=6379,state=online,lag=0 master_replid:3d546d4d0c13ca50673fe7c36b46364a9668c9b0
- Force a failover (by killing the master):
k delete pod/dragonfly-sample-0
- Observe that the role & endpoint change properly ✅ :
k get pods -lapp=dragonfly-sample -L role NAME READY STATUS RESTARTS AGE ROLE dragonfly-sample-0 2/2 Running 1 (36s ago) 54s replica dragonfly-sample-1 2/2 Running 0 5m31s master k get endpoints -w dragonfly-sample 10.64.2.38:6379 104s dragonfly-sample <none> 4m49s dragonfly-sample 10.64.4.38:6379 4m49s
- But the recreated pod get OOM when trying do get back in the cluster :
k describe pod/dragonfly-sample-0 [...] State: Running Started: Tue, 19 Aug 2025 12:21:29 +0200 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Tue, 19 Aug 2025 12:21:15 +0200 Finished: Tue, 19 Aug 2025 12:21:29 +0200 Ready: True Restart Count: 1 Limits: memory: 350Mi Requests: cpu: 500m memory: 350Mi [...]
- The operator seems to have tried to enable the replication in the first place ✅ (before the OOM):
but the container got OOMKilled just after ~1sec after the "replicating" log line
k logs -p dragonfly-sample-0 [...] I20250819 10:21:18.528891 11 server_family.cc:1369] Load finished, num keys read: 228776 I20250819 10:21:28.490420 11 server_family.cc:3367] Replicating 10.64.4.38:9999
- But it doesn’t seem to enable the replication during the 2nd start of the container (the one after the OOM) ❌ :
k logs dragonfly-sample-0 [...] I20250819 10:21:32.601600 11 server_family.cc:1369] Load finished, num keys read: 228776 I20250819 10:22:12.827792 11 save_stages_controller.cc:346] Saving "/data/dump-summary.dfs" finished after 12 s
- And the replication is indeed not working, cf no replica connected the the master ❌ :
### On the (new) master: k exec dragonfly-sample-1 -- redis-cli info replication role:master connected_slaves:0 master_replid:27113030f37dcce25694d468c1cbdd2d7879b996
- One small good news: the operator has seen that the replication is not properly defined:
Not all new replicas are in stable status yet
(cf full logs below) ✅, but can’t reconcile it (and logs cycling in a loop) ❌k logs pod/dragonfly-operator-5684d8889b-lcc6s 2025-08-19T10:26:08Z INFO reconciling dragonfly object {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec"} 2025-08-19T10:26:08Z INFO reconciling dragonfly resource {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "Kind": "StatefulSet", "Namespace": "xxx", "Name": "dragonfly-sample"} 2025-08-19T10:26:08Z INFO reconciling dragonfly resource {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "Kind": "Service", "Namespace": "xxx", "Name": "dragonfly-sample"} 2025-08-19T10:26:09Z INFO Rolling out new version {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec"} 2025-08-19T10:26:09Z INFO New Replica found. Checking if replica had a full sync {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "pod": "dragonfly-sample-0"} 2025-08-19T10:26:09Z INFO Not all new replicas are in stable status yet {"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "pod": "dragonfly-sample-0", "reason": null}
Expected behavior
We think that the dragonfly operator should be able reconcile the replication in this (OOM) use-case:
=> probably need a patch in the operator to retry when he can’t rebuild the replication
NB: the reason of the OOM is another topic that is not the purpose of this issue, we think it's important to focus on being to fix the cluster after any issue (=> general resiliency) and the OOM is just one of the many possible reasons that a pod/container could fail.
Details
The DB:
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
labels:
app.kubernetes.io/created-by: dragonfly-operator
app.kubernetes.io/instance: dragonfly-sample
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/name: dragonfly
app.kubernetes.io/part-of: dragonfly-operator
name: dragonfly-sample
spec:
args:
- --dbfilename=dump
- --dir=/data
- --tiered_prefix=/data/tiered
- --tiered_max_file_size=3G
- --proactor_threads=1
image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.32.0
imagePullPolicy: IfNotPresent
replicas: 2
resources:
limits:
memory: 350Mi
requests:
cpu: 500m
memory: 350Mi
snapshot:
cron: '*/1 * * * *'
dir: /data
persistentVolumeClaimSpec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: pd-ssd-retain
NB: the 2nd container visible in the k get pods
output above is istio-proxy
, thus the 127.0.0.6