Description
I discovered two issues with persistence in PySyncObj: the election vote is not written to persistent storage, and the Raft log is not synced to disk and therefore not guaranteed to be preserved.
SyncObj.__votedFor
, the variable storing which node the current node has voted for in a running election, is not written to persistent storage before granting the vote. This means that if a node crashes after granting its vote and is restarted within the same election, it is possible for a leader to be elected by a minority of the cluster.
The second issue is that the Raft log is not written to disk before replying to AppendEntries. The FileJournal.add
method used in SyncObj
calls ResizableFile.write
, which writes the data to an mmap, but this data is not synced to disk (using mmap.flush
, the Python equivalent of msync(2)
). This may cause a loss of committed log entries in certain cases (slow cluster or many nodes failing around the same time, for example).
References regarding persistence in Raft: figure 2 in the paper and chapter 3.8 in Diego's thesis