Description
Our current implementation of CFT consensus still suffers from some limitations and liveness issues around omission faults (i.e. dropped messages). This is especially the case if one or several nodes are partitioned out of the service, as demonstrated by the end-to-end tests introduced in #2553: a single partitioned backup will automatically become candidate if it was partitioned for >= election_timeout
.
Note that this is only true when no new write transactions are processed by the current leader. Otherwise, the partitioned node wouldn't be able to win an election as its last known seqno
would be behind.
The following two extensions should help mitigate this family of issues:
1. PreVote
Each potential candidate should first check that a quorum of nodes would accept this node as the primary should it become candidate. Only then the node should transition to a candidate state and request votes from other nodes.
Other nodes should respond to PreVote
messages as if it was a real election, but don't need to keep track of which nodes they have granted their PreVote
. It is only when a quorum of nodes have responded positively to the PreVote
round that the node can become candidate.
2. Leader stickiness/CheckQuorum
The goal here is to make sure that a primary stays primary for as long as possible, i.e. doesn't step down because one node only started an election.
Nodes should grant their PreVote
/Vote
s if they haven't heard from a primary within their election timeout. As I understand it, this implies that a node should only grant PreVote
/Vote
s when it is already in the new "campaign" (is this a good name for it?) or candidate state.
Moreover, a primary should actively step down (i.e. become a follower in the same term of its primary-ness) if it hasn't heard AppendEntries
responses from a majority of backups within the election timeout.
Note that this also impacts the "sunny day" election scenario as the first half of the nodes whose election timeout expires wouldn't manage to get a quorum of nodes (because these ones still known about the current leader and haven't yet timed out). This is also a positive change as this would basically average out the election timeout of the service over a quorum of nodes rather than be set by the single node with the smallest election timeout.
Sources: