-
Notifications
You must be signed in to change notification settings - Fork 28
Processing a large incoming AppendEntriesReply in I/O thread can trigger an election timeout on the Receiver #40
Description
Not sure the best way to put the trace logging into a github issue so I'll just leave that messiness until the end.
The situation I have encountered is that with a small enough election timeout (I am using 300ms) when a node tries to re-enter the cluster there can be enough entries in the first AppendEntries message it receives that the I/O thread actually blocks long enough for an election timeout. This seems to cause that node to then send AppendEntriesReplies to all of the AppendEntries that backed up (due to heartbeats) but with the new term (since the node started an election). I was able to "fix" this by adding a call to scheduleElectionTimeout() at the end of each iteration of the for(LogEntry entry : entries) loop in onAppendEntries. Not a particularly elegant solution. Changing config params will also fix it but I thought it was worth reporting.
I think a follower could also just ignore AppendEntriesReply RPCs instead of failing on the precondition of being a leader. However, I'm sure you have spent more time with the algorithm than me and may be able to think of a reason why that would be a bad idea.
Here is some evidence of the issue.
The exception on the current leader
[New I/O worker #4] TRACE io.libraft.algorithm.RaftAlgorithm - agent2: RequestVote from agent1: term:2 lastLogIndex:15 lastLogTerm:1
[New I/O worker #4] INFO io.libraft.algorithm.RaftAlgorithm - agent2: changing role LEADER->FOLLOWER in term 2
[New I/O worker #4] INFO io.libraft.algorithm.RaftAlgorithm - agent2: leader changed from agent2 to null
[New I/O worker #4] TRACE io.libraft.algorithm.RaftAlgorithm - agent2: AppendEntriesReply from agent1: term:2 prevLogIndex:6 entryCount:9 applied:false
[New I/O worker #4] ERROR org.jboss.netty.channel.SimpleChannelUpstreamHandler - agent2: uncaught exception processing rpc:AppendEntriesReply{source=agent1, destination=agent2, term=2, prevLogIndex=6, entryCount=9, applied=false} from agent1
java.lang.IllegalStateException: role:FOLLOWER
The follower timing out while processing log entries
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: AppendEntries from agent2: term:1 commitIndex:15 prevLogIndex:6 prevLogTerm:1 entryCount:9
[New I/O worker #3] INFO io.libraft.algorithm.RaftAlgorithm - agent1: leader changed from null to agent2
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=7, term=1, command=PrintCommand{commandId=7613830809909165162, toPrint=a string 6}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=8, term=1, command=PrintCommand{commandId=5002286235719145647, toPrint=a string 7}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=9, term=1, command=PrintCommand{commandId=-1668250811023977474, toPrint=a string 8}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=10, term=1, command=PrintCommand{commandId=3591929921586279742, toPrint=a string 9}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=11, term=1, command=PrintCommand{commandId=7008605882065303194, toPrint=a string 10}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=12, term=1, command=PrintCommand{commandId=13354652848121111, toPrint=a string 11}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=13, term=1, command=PrintCommand{commandId=-6877329815755806333, toPrint=a string 12}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=14, term=1, command=PrintCommand{commandId=-5953431672926618152, toPrint=a string 13}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=15, term=1, command=PrintCommand{commandId=7768522539884141905, toPrint=a string 14}}
At index 7 type PRINT
At index 8 type PRINT
At index 9 type PRINT
At index 10 type PRINT
At index 11 type PRINT
At index 12 type PRINT
At index 13 type PRINT
At index 14 type PRINT
At index 15 type PRINT
[Timer-0] INFO io.libraft.algorithm.RaftAlgorithm - agent1: handle election timeout
[Timer-0] INFO io.libraft.algorithm.RaftAlgorithm - agent1: changing role FOLLOWER->CANDIDATE in term 2
[Timer-0] INFO io.libraft.algorithm.RaftAlgorithm - agent1: leader changed from agent2 to null