-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
0.8.0 (latest release)
Please describe the bug 🐞
When processing fetch requests in ReplicaManager.readFromLog(), if any bucket encounters an error (e.g., NOT_LEADER_OR_FOLLOWER, UNKNOWN_TABLE_OR_BUCKET_EXCEPTION), the current implementation immediately short-circuits the entire fetch request.
This short-circuit behavior bypasses the DelayedFetch mechanism, causing the fetch response to be returned immediately. As a result, ReplicaFetcherThread receives the response without any delay and retries immediately. During leader election or bucket migration, these errors persist temporarily, leading to a tight retry loop without any backoff.
Additionally, in ReplicaFetcherThread, when handling NOT_LEADER_OR_FOLLOWER error, the replica was not added to replicasWithError, preventing proper error tracking and handling.
Solution
- Classify fetch errors into critical and non-critical categories:
- Non-critical (expected) errors:
NOT_LEADER_OR_FOLLOWER,UNKNOWN_TABLE_OR_BUCKET_EXCEPTION - Critical errors: all other errors
- Avoid short-circuiting for non-critical errors:
- Collect non-critical error buckets separately instead of breaking immediately
- Allow the fetch request to continue processing other buckets and enter the DelayedFetch flow normally
- Merge the error buckets into the delayed response callback
- Fix error tracking in ReplicaFetcherThread:
- Add the replica to replicasWithError when
NOT_LEADER_OR_FOLLOWERerror occurs
This ensures that even during leader election or bucket migration, fetch requests still go through the normal delay mechanism, preventing busy loop retry storms.
Are you willing to submit a PR?
- I'm willing to submit a PR!