Skip to content
This repository was archived by the owner on Dec 17, 2018. It is now read-only.
This repository was archived by the owner on Dec 17, 2018. It is now read-only.

Processing a large incoming AppendEntriesReply in I/O thread can trigger an election timeout on the Receiver #40

@ZymoticB

Description

@ZymoticB

Not sure the best way to put the trace logging into a github issue so I'll just leave that messiness until the end.

The situation I have encountered is that with a small enough election timeout (I am using 300ms) when a node tries to re-enter the cluster there can be enough entries in the first AppendEntries message it receives that the I/O thread actually blocks long enough for an election timeout. This seems to cause that node to then send AppendEntriesReplies to all of the AppendEntries that backed up (due to heartbeats) but with the new term (since the node started an election). I was able to "fix" this by adding a call to scheduleElectionTimeout() at the end of each iteration of the for(LogEntry entry : entries) loop in onAppendEntries. Not a particularly elegant solution. Changing config params will also fix it but I thought it was worth reporting.

I think a follower could also just ignore AppendEntriesReply RPCs instead of failing on the precondition of being a leader. However, I'm sure you have spent more time with the algorithm than me and may be able to think of a reason why that would be a bad idea.

Here is some evidence of the issue.

The exception on the current leader

[New I/O worker #4] TRACE io.libraft.algorithm.RaftAlgorithm - agent2: RequestVote from agent1: term:2 lastLogIndex:15 lastLogTerm:1
[New I/O worker #4] INFO io.libraft.algorithm.RaftAlgorithm - agent2: changing role LEADER->FOLLOWER in term 2
[New I/O worker #4] INFO io.libraft.algorithm.RaftAlgorithm - agent2: leader changed from agent2 to null
[New I/O worker #4] TRACE io.libraft.algorithm.RaftAlgorithm - agent2: AppendEntriesReply from agent1: term:2 prevLogIndex:6 entryCount:9 applied:false
[New I/O worker #4] ERROR org.jboss.netty.channel.SimpleChannelUpstreamHandler - agent2: uncaught exception processing rpc:AppendEntriesReply{source=agent1, destination=agent2, term=2, prevLogIndex=6, entryCount=9, applied=false} from agent1
java.lang.IllegalStateException: role:FOLLOWER

The follower timing out while processing log entries

[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: AppendEntries from agent2: term:1 commitIndex:15 prevLogIndex:6 prevLogTerm:1 entryCount:9
[New I/O worker #3] INFO io.libraft.algorithm.RaftAlgorithm - agent1: leader changed from null to agent2
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=7, term=1, command=PrintCommand{commandId=7613830809909165162, toPrint=a string 6}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=8, term=1, command=PrintCommand{commandId=5002286235719145647, toPrint=a string 7}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=9, term=1, command=PrintCommand{commandId=-1668250811023977474, toPrint=a string 8}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=10, term=1, command=PrintCommand{commandId=3591929921586279742, toPrint=a string 9}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=11, term=1, command=PrintCommand{commandId=7008605882065303194, toPrint=a string 10}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=12, term=1, command=PrintCommand{commandId=13354652848121111, toPrint=a string 11}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=13, term=1, command=PrintCommand{commandId=-6877329815755806333, toPrint=a string 12}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=14, term=1, command=PrintCommand{commandId=-5953431672926618152, toPrint=a string 13}}
[New I/O worker #3] TRACE io.libraft.algorithm.RaftAlgorithm - agent1: add entry:ClientEntry{type=CLIENT, index=15, term=1, command=PrintCommand{commandId=7768522539884141905, toPrint=a string 14}}
At index 7 type PRINT
At index 8 type PRINT
At index 9 type PRINT
At index 10 type PRINT
At index 11 type PRINT
At index 12 type PRINT
At index 13 type PRINT
At index 14 type PRINT
At index 15 type PRINT
[Timer-0] INFO io.libraft.algorithm.RaftAlgorithm - agent1: handle election timeout
[Timer-0] INFO io.libraft.algorithm.RaftAlgorithm - agent1: changing role FOLLOWER->CANDIDATE in term 2
[Timer-0] INFO io.libraft.algorithm.RaftAlgorithm - agent1: leader changed from agent2 to null

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions