Skip to content

When TabletServer clean shutdown, client will throw OutOfOrderSequenceException #709

Open
@swuferhong

Description

@swuferhong

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

main (development)

Please describe the bug 🐞

When we attempt to kill multiple TabletServers, the client generates a large number of OutOfOrderSequenceException exceptions:

Image

The root cause of this issue is that the current way of updating highWatermark is first updates the follower's highWatermark and later updates the leader's highWatermark in the next round (refer to #676 ). This approach introduces a problem:
During recovery, the follower's highWatermark might exceed the ackedBatchSequence by the client. For instance:

  1. The follower has written batches with batchSequence 3, 4 and updated its highWatermark.
  2. Meanwhile, the client has not yet received ackes for batches 3, 4, so it still considers its ackedBatchSequence to be 2.
  3. If the leader and follower crashes at this point, the client assumes the write operation has failed.
  4. The follower is then elected as the new leader.
  5. When handling a new request for batch 3, the new leader detects an inconsistency, triggering an OutOfOrderSequenceException.
  6. The client identifies this scenario as unrecoverable and cannot retry the request, ultimately causing a failover.
    This mismatch between the follower’s writerState and the client’s ackedBatchSequence directly leads to the failure.

Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions