Open
Description
Version & Environment
master
What went wrong?
raft no available leader the leader elected and step down due commit fail.
[2024-10-09 18:39:02,084] INFO [RaftManager id=2] Become candidate due to fetch timeout (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:02,085] INFO [RaftManager id=2] Completed transition to CandidateState(localId=2, localDirectoryId=ZopiWufpyFjYOJ-CzWKiMQ,epoch=951, retries=1, voteStates={1=UNRECORDED, 2=GRANTED, 3=UNRECORDED}, highWatermark=Optional[LogOffsetMetadata(offset=3186751955, metadata=Optional.empty)], electionTimeoutMs=1495) from FollowerState(fetchTimeoutMs=2000, epoch=950, leaderId=1, voters=[1, 2, 3], highWatermark=Optional[LogOffsetMetadata(offset=3186751955, metadata=Optional.empty)], fetchingSnapshot=Optional.empty) (org.apache.kafka.raft.QuorumState)
[2024-10-09 18:39:02,087] INFO [RaftManager id=2] Completed transition to Leader(localId=2, epoch=951, epochStartOffset=3186751957, highWatermark=Optional.empty, voterStates={1=ReplicaState(nodeId=1, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=false), 2=ReplicaState(nodeId=2, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=true), 3=ReplicaState(nodeId=3, endOffset=Optional.empty, lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=false)}) from CandidateState(localId=2, localDirectoryId=ZopiWufpyFjYOJ-CzWKiMQ,epoch=951, retries=1, voteStates={1=GRANTED, 2=GRANTED, 3=UNRECORDED}, highWatermark=Optional[LogOffsetMetadata(offset=3186751955, metadata=Optional.empty)], electionTimeoutMs=1495) (org.apache.kafka.raft.QuorumState)
[2024-10-09 18:39:02,090] INFO [RaftManager id=2] High watermark set to LogOffsetMetadata(offset=3186751958, metadata=Optional[(segmentBaseOffset=3182036153,relativePositionInSegment=260993522)]) for the first time for epoch 951 based on indexOfHw 1 and voters [ReplicaState(nodeId=1, endOffset=Optional[LogOffsetMetadata(offset=3186751958, metadata=Optional[(segmentBaseOffset=3182036153,relativePositionInSegment=260993522)])], lastFetchTimestamp=1728470342090, lastCaughtUpTimestamp=1728470342090, hasAcknowledgedLeader=true), ReplicaState(nodeId=2, endOffset=Optional[LogOffsetMetadata(offset=3186751958, metadata=Optional[(segmentBaseOffset=3182036153,relativePositionInSegment=260993522)])], lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=true), ReplicaState(nodeId=3, endOffset=Optional[LogOffsetMetadata(offset=3186751957, metadata=Optional[(segmentBaseOffset=3182036153,relativePositionInSegment=260993416)])], lastFetchTimestamp=1728470342089, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=true)] (org.apache.kafka.raft.LeaderState)
[2024-10-09 18:39:05,994] ERROR Encountered quorum controller fault: commitStreamSetObject: event failed with IllegalStateException (treated as UnknownServerException) at epoch 951 in 12120 microseconds. Renouncing leadership and reverting to the last committed offset 3186752114. (org.apache.kafka.server.fault.LoggingFaultHandler)
java.lang.IllegalStateException: Attempted to atomically commit 38457 records, but maxRecordsPerBatch is 25000
at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:1034)
at org.apache.kafka.controller.QuorumController$ControllerWriteEvent.run(QuorumController.java:936)
at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:131)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:214)
at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:185)
at java.base/java.lang.Thread.run(Thread.java:833)
[2024-10-09 18:39:05,994] INFO [RaftManager id=2] Received user request to resign from the current epoch 951 (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:05,994] INFO [RaftManager id=2] Failed to handle fetch from 3 at 3186752115 due to NOT_LEADER_OR_FOLLOWER (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:05,994] INFO [RaftManager id=2] Failed to handle fetch from 1 at 3186752115 due to NOT_LEADER_OR_FOLLOWER (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:05,994] INFO [RaftManager id=2] Failed to handle fetch from 1001 at 3186752115 due to NOT_LEADER_OR_FOLLOWER (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:05,994] INFO [RaftManager id=2] Failed to handle fetch from 1004 at 3186752115 due to NOT_LEADER_OR_FOLLOWER (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:05,994] INFO [RaftManager id=2] Failed to handle fetch from 1003 at 3186752115 due to NOT_LEADER_OR_FOLLOWER (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:05,994] INFO [RaftManager id=2] Failed to handle fetch from 1002 at 3186752115 due to NOT_LEADER_OR_FOLLOWER (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:05,994] INFO [RaftManager id=2] Completed transition to ResignedState(localId=2, epoch=951, voters=[1, 2, 3], electionTimeoutMs=1366, unackedVoters=[1, 3], preferredSuccessors=[1, 3]) from Leader(localId=2, epoch=951, epochStartOffset=3186751957, highWatermark=Optional[LogOffsetMetadata(offset=3186752115, metadata=Optional[(segmentBaseOffset=3182036153,relativePositionInSegment=261000559)])], voterStates={1=ReplicaState(nodeId=1, endOffset=Optional[LogOffsetMetadata(offset=3186752115, metadata=Optional[(segmentBaseOffset=3182036153,relativePositionInSegment=261000559)])], lastFetchTimestamp=1728470345961, lastCaughtUpTimestamp=1728470345961, hasAcknowledgedLeader=true), 2=ReplicaState(nodeId=2, endOffset=Optional[LogOffsetMetadata(offset=3186752115, metadata=Optional[(segmentBaseOffset=3182036153,relativePositionInSegment=261000559)])], lastFetchTimestamp=-1, lastCaughtUpTimestamp=-1, hasAcknowledgedLeader=true), 3=ReplicaState(nodeId=3, endOffset=Optional[LogOffsetMetadata(offset=3186752115, metadata=Optional[(segmentBaseOffset=3182036153,relativePositionInSegment=261000559)])], lastFetchTimestamp=1728470345961, lastCaughtUpTimestamp=1728470345961, hasAcknowledgedLeader=true)}) (org.apache.kafka.raft.QuorumState)
[2024-10-09 18:39:05,998] INFO [RaftManager id=2] Completed transition to Unattached(epoch=952, voters=[1, 2, 3], electionTimeoutMs=1238) from ResignedState(localId=2, epoch=951, voters=[1, 2, 3], electionTimeoutMs=1366, unackedVoters=[], preferredSuccessors=[1, 3]) (org.apache.kafka.raft.QuorumState)
[2024-10-09 18:39:05,998] INFO [RaftManager id=2] Completed transition to Voted(epoch=952, votedKey=ReplicaKey(id=1, directoryId=Optional.empty), voters=[1, 2, 3], electionTimeoutMs=1223, highWatermark=Optional.empty) from Unattached(epoch=952, voters=[1, 2, 3], electionTimeoutMs=1238) (org.apache.kafka.raft.QuorumState)
[2024-10-09 18:39:05,998] INFO [RaftManager id=2] Vote request VoteRequestData(clusterId='DMSoJVXo9Q', topics=[TopicData(topicName='__cluster_metadata', partitions=[PartitionData(partitionIndex=0, candidateEpoch=952, candidateId=1, lastOffsetEpoch=951, lastOffset=3186752115)])]) with epoch 952 is granted (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:06,002] INFO [RaftManager id=2] Completed transition to FollowerState(fetchTimeoutMs=2000, epoch=952, leaderId=1, voters=[1, 2, 3], highWatermark=Optional.empty, fetchingSnapshot=Optional.empty) from Voted(epoch=952, votedKey=ReplicaKey(id=1, directoryId=Optional.empty), voters=[1, 2, 3], electionTimeoutMs=1223, highWatermark=Optional.empty) (org.apache.kafka.raft.QuorumState)
[2024-10-09 18:39:06,100] INFO [RaftManager id=2] High watermark set to Optional[LogOffsetMetadata(offset=3186752116, metadata=Optional.empty)] for the first time for epoch 952 (org.apache.kafka.raft.FollowerState)
[2024-10-09 18:39:07,177] INFO [RaftManager id=2] Become candidate due to fetch timeout (org.apache.kafka.raft.KafkaRaftClient)
[2024-10-09 18:39:07,179] INFO [RaftManager id=2] Completed transition to CandidateState(localId=2, localDirectoryId=ZopiWufpyFjYOJ-CzWKiMQ,epoch=953, retries=1, voteStates={1=UNRECORDED, 2=GRANTED, 3=UNRECORDED}, highWatermark=Optional[LogOffsetMetadata(offset=3186752217, metadata=Optional.empty)], electionTimeoutMs=1295) from FollowerState(fetchTimeoutMs=2000, epoch=952, leaderId=1, voters=[1, 2, 3], highWatermark=Optional[LogOffsetMetadata(offset=3186752217, metadata=Optional.empty)], fetchingSnapshot=Optional.empty) (org.apache.kafka.raft.QuorumState)
What should have happened instead?
How to reproduce the issue?
create 5w+ partition and delete them at 100 concurrency and only have 5 node in cluster.
stop one node and start it.
so one node have big partition number. once trigger upload their may have a lot of StreamObject and Stream to upload
I think when commitSSO this will happen. and this may cause the whole cluster not function.
Additional information
Please attach any relevant logs, backtraces, or metric charts.