-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
0.8.0 (latest release)
Please describe the bug 🐞
The client can deadlock when multiple concurrent requests trigger synchronous metadata updates on Netty IO threads. The synchronous getPartitionId() call in lookup operations blocks the IO thread while waiting for metadata responses, which can cause a deadlock when many concurrent requests need metadata updates simultaneously.
Root Cause:
PrimaryKeyLookuper.lookup()synchronously callsgetPartitionId()on Netty IO threads- This blocks IO threads while waiting for metadata updates to complete
- Under high concurrency, all IO threads can become blocked waiting for metadata responses
- No available threads to process incoming metadata responses → deadlock
Impact:
- Client hangs indefinitely when lookup operations require partition metadata
- Affects all operations using primary key or secondary index lookups
- More likely to occur with partitioned tables under high concurrency
Solution
Refactor metadata update mechanism to be fully asynchronous with request batching and deduplication:
-
Async Metadata Updates: Provide
CompletableFuture-based APIs (checkAndUpdatePartitionMetadataAsync(),updateMetadataAsync()) to avoid blocking Netty IO threads -
Request Batching: Aggregate multiple concurrent metadata requests into a single RPC call to reduce network overhead and contention
-
Request Deduplication: Multiple concurrent requests for the same resource (table/partition) share the same update future, preventing duplicate RPC calls
-
Atomic Resource Keys: Each metadata resource (table, partition, partition ID) is treated as an atomic unit for deduplication
Are you willing to submit a PR?
- I'm willing to submit a PR!