Detect master identity change/replication offset rewind and stop replication

dragonfly version: 1.33.1

When running a master/replica setup on k8s along with the dragonfly operator and the master fails, there's no guarantee that the operator can kick in before the master pod restarts. This itself is not a bug, it's just the nature of a distributed system. What this means in practice though is that (assuming the operator is stuck/sleeping) the replica will restart replicating from the freshly restarted master, which is potentially empty -> this leads to data loss.

It seems that upstream Redis exposes a `run_id` INFO field, that's basically a unique ID generated on each startup. I suppose that we could avoid the data loss described above by either:

1. Adding the same info field to Dragonfly (or maybe there's another field we can already use to uniquely identify a server execution?) + add a way to stop replication on a replica if the connection breaks, and the master ID changed when reconnecting
2. Preventing replication if it's rewinding the replication offset(s?)
 
The assumption here is that the operator will eventually figure out which replica has the highest offset, and configure that one as the master. The cluster will still be unavailable/inconsistent until the operator wakes up, but will eventually converge to a state where all data is there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect master identity change/replication offset rewind and stop replication #385

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Detect master identity change/replication offset rewind and stop replication #385

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions