-
Notifications
You must be signed in to change notification settings - Fork 73
Description
dragonfly version: 1.33.1
When running a master/replica setup on k8s along with the dragonfly operator and the master fails, there's no guarantee that the operator can kick in before the master pod restarts. This itself is not a bug, it's just the nature of a distributed system. What this means in practice though is that (assuming the operator is stuck/sleeping) the replica will restart replicating from the freshly restarted master, which is potentially empty -> this leads to data loss.
It seems that upstream Redis exposes a run_id
INFO field, that's basically a unique ID generated on each startup. I suppose that we could avoid the data loss described above by either:
- Adding the same info field to Dragonfly (or maybe there's another field we can already use to uniquely identify a server execution?) + add a way to stop replication on a replica if the connection breaks, and the master ID changed when reconnecting
- Preventing replication if it's rewinding the replication offset(s?)
The assumption here is that the operator will eventually figure out which replica has the highest offset, and configure that one as the master. The cluster will still be unavailable/inconsistent until the operator wakes up, but will eventually converge to a state where all data is there.