Skip to content

SeverNode info of all tabletServers maybe invalid in client metadata when cluster upgrading #2097

@swuferhong

Description

@swuferhong

Search before asking

  • I searched in the issues and found nothing similar.

Fluss version

0.8.0 (latest release)

Please describe the bug 🐞

There will be one situation that SeverNode info of all tabletServers maybe invalid in client metadata when cluster upgrading, which will cause the write/read/lookup operation block forever. Image this case:

There is a Fluss cluster with 3 TabletServers and a Fluss write job running with high parallelism. The cluster undergoes a rolling upgrade under the following conditions:

  1. Pods are not upgraded in-place—their IP addresses change after restart.
  2. Fluss networking has no built-in timeout mechanism and relies solely on the Netty client’s connection timeout. If a request is sent to a disconnected server, the client will wait indefinitely for the server’s response (sync acknowledgment) until the Netty timeout of 120 seconds is reached.

Below is the failure scenario:

  1. Initial state:
      ts-0 → 192.108.0.1
      ts-1 → 192.108.0.2
      ts-2 → 192.108.0.3

  2. Upgrade starts: ts-0 becomes unreachable. The client attempts to updateMetadata by sending the request to ts-0, fails to connect, and waits 120 seconds.

  3. ts-0 restarts with a new IP: 192.108.0.4.

  4. The client retries updateMetadata, again targeting ts-0 (based on stale metadata), and waits another 120 seconds. Meanwhile, ts-1 finishes its upgrade and gets a new IP: 192.108.0.5.

  5. Another updateMetadata attempt is made—possibly to ts-0 or ts-1 (both now with outdated IPs in the client’s cache)—and the client waits yet another 120 seconds. At this point, ts-2 also completes its upgrade and changes its IP to 192.108.0.6.

  6. After this, all TabletServer IPs in the client’s metadata cache are stale. No matter which server the client tries to contact, it uses an incorrect IP, causing all subsequent requests to time out. As a result, the job cannot recover automatically.

Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions