Raft extensions for omission faults

Our current implementation of CFT consensus still suffers from some limitations and liveness issues around omission faults (i.e. dropped messages). This is especially the case if one or several nodes are partitioned out of the service, as demonstrated by the end-to-end tests introduced in #2553: a single partitioned backup will automatically become candidate if it was partitioned for >= `election_timeout`. 

Note that this is only true when no new write transactions are processed by the current leader. Otherwise, the partitioned node wouldn't be able to win an election as its last known `seqno` would be behind.

The following two extensions should help mitigate this family of issues:

### 1. PreVote 

Each potential candidate should first check that a quorum of nodes would accept this node as the primary should it become candidate. Only then the node should transition to a candidate state and request votes from other nodes.

Other nodes should respond to `PreVote` messages as if it was a real election, but don't need to keep track of which nodes they have granted their `PreVote`. It is only when a quorum of nodes have responded positively to the `PreVote` round that the node can become candidate.

### 2. Leader stickiness/CheckQuorum

The goal here is to make sure that a primary stays primary for as long as possible, i.e. doesn't step down because one node _only_ started an election.

Nodes should grant their `PreVote`/`Vote`s if they haven't heard from a primary within their election timeout. As I understand it, this implies that a node should only grant `PreVote`/`Vote`s when it is already in the new "campaign" (is this a good name for it?) or candidate state.

Moreover, a primary should actively step down (i.e. become a follower in the same term of its primary-ness) if it hasn't heard `AppendEntries` responses from a majority of backups within the election timeout.

Note that this also impacts the "sunny day" election scenario as the first half of the nodes whose election timeout expires wouldn't manage to get a quorum of nodes (because these ones still known about the current leader and haven't yet timed out). This is also a positive change as this would basically average out the election timeout of the service over a quorum of nodes rather than be set by the _single_ node with the smallest election timeout.

**Sources:**
- https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/
- https://www.openlife.cc/sites/default/files/4-modifications-for-Raft-consensus.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raft extensions for omission faults #2577

1. PreVote

2. Leader stickiness/CheckQuorum

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development