-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Search before asking
- I searched in the issues and found nothing similar.
Fluss version
0.8.0 (latest release)
Please describe the bug 🐞
There will be one situation that SeverNode info of all tabletServers maybe invalid in client metadata when cluster upgrading, which will cause the write/read/lookup operation block forever. Image this case:
There is a Fluss cluster with 3 TabletServers and a Fluss write job running with high parallelism. The cluster undergoes a rolling upgrade under the following conditions:
- Pods are not upgraded in-place—their IP addresses change after restart.
- Fluss networking has no built-in timeout mechanism and relies solely on the Netty client’s connection timeout. If a request is sent to a disconnected server, the client will wait indefinitely for the server’s response (sync acknowledgment) until the Netty timeout of 120 seconds is reached.
Below is the failure scenario:
-
Initial state:
ts-0 → 192.108.0.1
ts-1 → 192.108.0.2
ts-2 → 192.108.0.3 -
Upgrade starts: ts-0 becomes unreachable. The client attempts to
updateMetadataby sending the request to ts-0, fails to connect, and waits 120 seconds. -
ts-0 restarts with a new IP: 192.108.0.4.
-
The client retries
updateMetadata, again targeting ts-0 (based on stale metadata), and waits another 120 seconds. Meanwhile, ts-1 finishes its upgrade and gets a new IP: 192.108.0.5. -
Another
updateMetadataattempt is made—possibly to ts-0 or ts-1 (both now with outdated IPs in the client’s cache)—and the client waits yet another 120 seconds. At this point, ts-2 also completes its upgrade and changes its IP to 192.108.0.6. -
After this, all TabletServer IPs in the client’s metadata cache are stale. No matter which server the client tries to contact, it uses an incorrect IP, causing all subsequent requests to time out. As a result, the job cannot recover automatically.
Solution
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!