SeverNode info of all tabletServers maybe invalid in client metadata when cluster upgrading

### Search before asking

- [x] I searched in the [issues](https://github.com/apache/fluss/issues) and found nothing similar.


### Fluss version

0.8.0 (latest release)

### Please describe the bug 🐞

There will be one situation that SeverNode info of all tabletServers maybe invalid in client metadata when cluster upgrading, which will cause the `write/read/lookup` operation block forever. Image this case:

There is a `Fluss` cluster with 3 `TabletServers` and a Fluss write job running with high parallelism. The cluster undergoes a rolling upgrade under the following conditions:

1. Pods are not upgraded in-place—their IP addresses change after restart.
2. Fluss networking has no built-in timeout mechanism and relies solely on the Netty client’s connection timeout. If a request is sent to a disconnected server, the client will wait indefinitely for the server’s response (sync acknowledgment) until the Netty timeout of 120 seconds is reached.

Below is the failure scenario:

1) **Initial state**:  
  ts-0 → 192.108.0.1  
  ts-1 → 192.108.0.2  
  ts-2 → 192.108.0.3  

2) **Upgrade starts**: ts-0 becomes unreachable. The client attempts to `updateMetadata` by sending the request to ts-0, fails to connect, and waits 120 seconds.

3) **ts-0 restarts** with a new IP: 192.108.0.4.

4) The client retries `updateMetadata`, again targeting ts-0 (based on stale metadata), and waits another 120 seconds. Meanwhile, ts-1 finishes its upgrade and gets a new IP: 192.108.0.5.

5) Another `updateMetadata` attempt is made—possibly to ts-0 or ts-1 (both now with outdated IPs in the client’s cache)—and the client waits yet another 120 seconds. At this point, ts-2 also completes its upgrade and changes its IP to 192.108.0.6.

6) **After this**, all TabletServer IPs in the client’s metadata cache are stale. No matter which server the client tries to contact, it uses an incorrect IP, causing all subsequent requests to time out. As a result, the job cannot recover automatically.

### Solution

_No response_

### Are you willing to submit a PR?

- [ ] I'm willing to submit a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SeverNode info of all tabletServers maybe invalid in client metadata when cluster upgrading #2097

Search before asking

Fluss version

Please describe the bug 🐞

Solution

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SeverNode info of all tabletServers maybe invalid in client metadata when cluster upgrading #2097

Description

Search before asking

Fluss version

Please describe the bug 🐞

Solution

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions