Open
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
main (development)
Please describe the bug 🐞
When we attempt to kill multiple TabletServers, the client generates a large number of OutOfOrderSequenceException exceptions:
The root cause of this issue is that the current way of updating highWatermark
is first updates the follower's highWatermark
and later updates the leader's highWatermark
in the next round (refer to #676 ). This approach introduces a problem:
During recovery, the follower's highWatermark
might exceed the ackedBatchSequence
by the client. For instance:
- The follower has written batches with batchSequence 3, 4 and updated its
highWatermark
. - Meanwhile, the client has not yet received ackes for batches 3, 4, so it still considers its
ackedBatchSequence
to be 2. - If the leader and follower crashes at this point, the client assumes the write operation has failed.
- The follower is then elected as the new leader.
- When handling a new request for batch 3, the new leader detects an inconsistency, triggering an OutOfOrderSequenceException.
- The client identifies this scenario as unrecoverable and cannot retry the request, ultimately causing a failover.
This mismatch between the follower’s writerState and the client’sackedBatchSequence
directly leads to the failure.
Solution
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!
Metadata
Metadata
Assignees
Labels
No labels