Skip to content

[Server] ReplicaFetcher busy loop retry storm during leader election or bucket migration #2073

@platinumhamburg

Description

@platinumhamburg

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.8.0 (latest release)

Please describe the bug 🐞

When processing fetch requests in ReplicaManager.readFromLog(), if any bucket encounters an error (e.g., NOT_LEADER_OR_FOLLOWER, UNKNOWN_TABLE_OR_BUCKET_EXCEPTION), the current implementation immediately short-circuits the entire fetch request.
This short-circuit behavior bypasses the DelayedFetch mechanism, causing the fetch response to be returned immediately. As a result, ReplicaFetcherThread receives the response without any delay and retries immediately. During leader election or bucket migration, these errors persist temporarily, leading to a tight retry loop without any backoff.
Additionally, in ReplicaFetcherThread, when handling NOT_LEADER_OR_FOLLOWER error, the replica was not added to replicasWithError, preventing proper error tracking and handling.

Solution

  1. Classify fetch errors into critical and non-critical categories:
  • Non-critical (expected) errors: NOT_LEADER_OR_FOLLOWER, UNKNOWN_TABLE_OR_BUCKET_EXCEPTION
  • Critical errors: all other errors
  1. Avoid short-circuiting for non-critical errors:
  • Collect non-critical error buckets separately instead of breaking immediately
  • Allow the fetch request to continue processing other buckets and enter the DelayedFetch flow normally
  • Merge the error buckets into the delayed response callback
  1. Fix error tracking in ReplicaFetcherThread:
  • Add the replica to replicasWithError when NOT_LEADER_OR_FOLLOWER error occurs

This ensures that even during leader election or bucket migration, fetch requests still go through the normal delay mechanism, preventing busy loop retry storms.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions