Skip to content

Can't reconcile (replication) after OOMKill #348

@frivoire

Description

@frivoire

Hello,

We are testing Dragonfly & the operator (which seems very nice globally 👍) and got an issue: the operator seems stuck in a reconciliation loop about replication in a specific situation: after a failover which didn't worked smoothly because of OOM of the container.
And it never finishes, because (if we understood properly) the "replicaof" command is not applied again, leaving the cluster in a dead-end situation.

How to reproduce

We start with an instance having some data stored inside.

  1. Initial state: a healthly simple master+replica setup :
    k get pods -lapp=dragonfly-sample -L role
    NAME                 READY   STATUS    RESTARTS   AGE    ROLE
    dragonfly-sample-0   2/2     Running   0          2m8s   master
    dragonfly-sample-1   2/2     Running   0          104s   replica
    
    k exec dragonfly-sample-0 -- redis-cli info replication
    role:master
    connected_slaves:1
    slave0:ip=127.0.0.6,port=6379,state=online,lag=0
    master_replid:3d546d4d0c13ca50673fe7c36b46364a9668c9b0
    
    => pod 0 is indeed master, it has the replica connected, and the replication works ✅
  2. Force a failover (by killing the master):
    k delete pod/dragonfly-sample-0
    
  3. Observe that the role & endpoint change properly ✅ :
    k get pods -lapp=dragonfly-sample -L role 
    NAME                 READY   STATUS    RESTARTS      AGE     ROLE
    dragonfly-sample-0   2/2     Running   1 (36s ago)   54s     replica
    dragonfly-sample-1   2/2     Running   0             5m31s   master
    
    k get endpoints -w
    dragonfly-sample    10.64.2.38:6379      104s
    dragonfly-sample    <none>               4m49s
    dragonfly-sample    10.64.4.38:6379      4m49s
    
  4. But the recreated pod get OOM when trying do get back in the cluster :
    k describe pod/dragonfly-sample-0
    [...]
        State:          Running
          Started:      Tue, 19 Aug 2025 12:21:29 +0200
        Last State:     Terminated
          Reason:       OOMKilled
          Exit Code:    137
          Started:      Tue, 19 Aug 2025 12:21:15 +0200
          Finished:     Tue, 19 Aug 2025 12:21:29 +0200
        Ready:          True
        Restart Count:  1
        Limits:
          memory:  350Mi
        Requests:
          cpu:      500m
          memory:   350Mi
    [...]
    
  5. The operator seems to have tried to enable the replication in the first place ✅ (before the OOM):
    k logs -p dragonfly-sample-0
    [...]
    I20250819 10:21:18.528891    11 server_family.cc:1369] Load finished, num keys read: 228776
    I20250819 10:21:28.490420    11 server_family.cc:3367] Replicating 10.64.4.38:9999
    
    but the container got OOMKilled just after ~1sec after the "replicating" log line
  6. But it doesn’t seem to enable the replication during the 2nd start of the container (the one after the OOM) ❌ :
    k logs dragonfly-sample-0
    [...]
    I20250819 10:21:32.601600    11 server_family.cc:1369] Load finished, num keys read: 228776
    I20250819 10:22:12.827792    11 save_stages_controller.cc:346] Saving "/data/dump-summary.dfs" finished after 12 s
    
  7. And the replication is indeed not working, cf no replica connected the the master ❌ :
    ### On the (new) master:
    k exec dragonfly-sample-1 -- redis-cli info replication
    role:master
    connected_slaves:0
    master_replid:27113030f37dcce25694d468c1cbdd2d7879b996
    
  8. One small good news: the operator has seen that the replication is not properly defined: Not all new replicas are in stable status yet (cf full logs below) ✅, but can’t reconcile it (and logs cycling in a loop) ❌
    k logs pod/dragonfly-operator-5684d8889b-lcc6s
    2025-08-19T10:26:08Z	INFO	reconciling dragonfly object	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec"}
    2025-08-19T10:26:08Z	INFO	reconciling dragonfly resource	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "Kind": "StatefulSet", "Namespace": "xxx", "Name": "dragonfly-sample"}
    2025-08-19T10:26:08Z	INFO	reconciling dragonfly resource	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "Kind": "Service", "Namespace": "xxx", "Name": "dragonfly-sample"}
    2025-08-19T10:26:09Z	INFO	Rolling out new version	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec"}
    2025-08-19T10:26:09Z	INFO	New Replica found. Checking if replica had a full sync	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "pod": "dragonfly-sample-0"}
    2025-08-19T10:26:09Z	INFO	Not all new replicas are in stable status yet	{"controller": "Dragonfly", "controllerGroup": "dragonflydb.io", "controllerKind": "Dragonfly", "Dragonfly": {"name":"dragonfly-sample","namespace":"xxx"}, "namespace": "xxx", "name": "dragonfly-sample", "reconcileID": "ad900fe2-4a01-49f5-8a94-c291e0a089ec", "pod": "dragonfly-sample-0", "reason": null}
    

Expected behavior

We think that the dragonfly operator should be able reconcile the replication in this (OOM) use-case:
=> probably need a patch in the operator to retry when he can’t rebuild the replication

NB: the reason of the OOM is another topic that is not the purpose of this issue, we think it's important to focus on being to fix the cluster after any issue (=> general resiliency) and the OOM is just one of the many possible reasons that a pod/container could fail.

Details

The DB:

apiVersion: dragonflydb.io/v1alpha1
kind: Dragonfly
metadata:
  labels:
    app.kubernetes.io/created-by: dragonfly-operator
    app.kubernetes.io/instance: dragonfly-sample
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: dragonfly
    app.kubernetes.io/part-of: dragonfly-operator
  name: dragonfly-sample
spec:
  args:
  - --dbfilename=dump
  - --dir=/data
  - --tiered_prefix=/data/tiered
  - --tiered_max_file_size=3G
  - --proactor_threads=1
  image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.32.0
  imagePullPolicy: IfNotPresent
  replicas: 2
  resources:
    limits:
      memory: 350Mi
    requests:
      cpu: 500m
      memory: 350Mi
  snapshot:
    cron: '*/1 * * * *'
    dir: /data
    persistentVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: pd-ssd-retain

NB: the 2nd container visible in the k get pods output above is istio-proxy, thus the 127.0.0.6

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions