Can't reconcile (replication) after OOMKill

Hello,

We are testing Dragonfly & the operator (which seems very nice globally 👍) and got an issue: the operator seems stuck in a reconciliation loop about replication in a specific situation: after a failover which didn't worked smoothly because of OOM of the container.
And it never finishes, because (if we understood properly) the "replicaof" command is not applied again, leaving the cluster in a dead-end situation.

## How to reproduce

We start with an instance having some data stored inside.

1. Initial state: a healthly simple master+replica setup :
   ```
   k get pods -lapp=dragonfly-sample -L role
   NAME                 READY   STATUS    RESTARTS   AGE    ROLE
   dragonfly-sample-0   2/2     Running   0          2m8s   master
   dragonfly-sample-1   2/2     Running   0          104s   replica
   
   k exec dragonfly-sample-0 -- redis-cli info replication
   role:master
   connected_slaves:1
   slave0:ip=127.0.0.6,port=6379,state=online,lag=0
   master_replid:3d546d4d0c13ca50673fe7c36b46364a9668c9b0
   ```
   => pod 0 is indeed master, it has the replica connected, and the replication works ✅
2. Force a failover (by killing the master):
   ```
   k delete pod/dragonfly-sample-0
   ```
3. Observe that the role & endpoint change properly ✅ : 
   ```
   k get pods -lapp=dragonfly-sample -L role 
   NAME                 READY   STATUS    RESTARTS      AGE     ROLE
   dragonfly-sample-0   2/2     Running   1 (36s ago)   54s     replica
   dragonfly-sample-1   2/2     Running   0             5m31s   master
   
   k get endpoints -w
   dragonfly-sample    10.64.2.38:6379      104s
   dragonfly-sample    <none>               4m49s
   dragonfly-sample    10.64.4.38:6379      4m49s
   ```
4. But the recreated pod get OOM when trying do get back in the cluster :
   ```
   k describe pod/dragonfly-sample-0
   [...]
       State:          Running
         Started:      Tue, 19 Aug 2025 12:21:29 +0200
       Last State:     Terminated
         Reason:       OOMKilled
         Exit Code:    137
         Started:      Tue, 19 Aug 2025 12:21:15 +0200
         Finished:     Tue, 19 Aug 2025 12:21:29 +0200
       Ready:          True
       Restart Count:  1
       Limits:
         memory:  350Mi
       Requests:
         cpu:      500m
         memory:   350Mi
   [...]
   ```
5. The operator seems to have tried to enable the replication in the first place ✅ (*before* the OOM):
   ```
   k logs -p dragonfly-sample-0
   [...]
   I20250819 10:21:18.528891    11 server_family.cc:1369] Load finished, num keys read: 228776
   I20250819 10:21:28.490420    11 server_family.cc:3367] Replicating 10.64.4.38:9999
   ```
   but the container got OOMKilled just after ~1sec after the "replicating" log line
6. But it doesn’t seem to enable the replication during the 2nd start of the container (the one _after_ the OOM) ❌  :
   ```
   k logs dragonfly-sample-0
   [...]
   I20250819 10:21:32.601600    11 server_family.cc:1369] Load finished, num keys read: 228776
   I20250819 10:22:12.827792    11 save_stages_controller.cc:346] Saving "/data/dump-summary.dfs" finished after 12 s
   ```
7. And the replication is indeed not working, cf no replica connected the the master ❌  : 
   ```
   ### On the (new) master:
   k exec dragonfly-sample-1 -- redis-cli info replication
   role:master
   connected_slaves:0
   master_replid:27113030f37dcce25694d468c1cbdd2d7879b996
   ```
8. One small good news: the operator has seen that the replication is not properly defined: `Not all new replicas are in stable status yet` (cf full logs below) ✅, but can’t reconcile it (and logs cycling in a loop) ❌ 
   ```
   k logs pod/dragonfly-operator-5684d8889b-lcc6s
   2025-08-19T10:26:08Z	INFO	reconciling dragonfly object	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec"}
   2025-08-19T10:26:08Z	INFO	reconciling dragonfly resource	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "Kind": "StatefulSet", "Namespace": "xxx", "Name": "dragonfly-sample"}
   2025-08-19T10:26:08Z	INFO	reconciling dragonfly resource	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "Kind": "Service", "Namespace": "xxx", "Name": "dragonfly-sample"}
   2025-08-19T10:26:09Z	INFO	Rolling out new version	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec"}
   2025-08-19T10:26:09Z	INFO	New Replica found. Checking if replica had a full sync	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "pod": "dragonfly-sample-0"}
   2025-08-19T10:26:09Z	INFO	Not all new replicas are in stable status yet	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "pod": "dragonfly-sample-0", "reason": null}
   ```

## Expected behavior
We think that the dragonfly operator should be able reconcile the replication in this (OOM) use-case:
=> probably need a patch in the operator to retry when he can’t rebuild the replication

NB: the reason of the OOM is another topic that is not the purpose of this issue, we think it's important to focus on being to fix the cluster after _any_ issue (=> general resiliency) and the OOM is just one of the many possible reasons that a pod/container could fail.


## Details
The DB:
```
apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  labels:
    app.kubernetes.io/created-by: dragonfly-operator
    app.kubernetes.io/instance: dragonfly-sample
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/part-of: dragonfly-operator
  name: dragonfly-sample
spec:
  args:
  - --dbfilename=dump
  - --dir=/data
  - --tiered_prefix=/data/tiered
  - --tiered_max_file_size=3G
  - --proactor_threads=1
  image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.32.0
  imagePullPolicy: IfNotPresent
  replicas: 2
  resources:
    limits:
      memory: 350Mi
    requests:
      cpu: 500m
      memory: 350Mi
  snapshot:
    cron: '*/1 * * * *'
    dir: /data
    persistentVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: pd-ssd-retain
```
NB: the 2nd container visible in the `k get pods` output above is `istio-proxy`, thus the `127.0.0.6`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can't reconcile (replication) after OOMKill #348

How to reproduce

Expected behavior

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't reconcile (replication) after OOMKill #348

Description

How to reproduce

Expected behavior

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions