Skip to content

Detect master identity change/replication offset rewind and stop replication #385

@abustany

Description

@abustany

dragonfly version: 1.33.1

When running a master/replica setup on k8s along with the dragonfly operator and the master fails, there's no guarantee that the operator can kick in before the master pod restarts. This itself is not a bug, it's just the nature of a distributed system. What this means in practice though is that (assuming the operator is stuck/sleeping) the replica will restart replicating from the freshly restarted master, which is potentially empty -> this leads to data loss.

It seems that upstream Redis exposes a run_id INFO field, that's basically a unique ID generated on each startup. I suppose that we could avoid the data loss described above by either:

  1. Adding the same info field to Dragonfly (or maybe there's another field we can already use to uniquely identify a server execution?) + add a way to stop replication on a replica if the connection breaks, and the master ID changed when reconnecting
  2. Preventing replication if it's rewinding the replication offset(s?)

The assumption here is that the operator will eventually figure out which replica has the highest offset, and configure that one as the master. The cluster will still be unavailable/inconsistent until the operator wakes up, but will eventually converge to a state where all data is there.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions