Skip to content

RFC: Random Tablet Balancer Mode #18786

@nickvanw

Description

@nickvanw

Feature Description

The flow-based balancer introduced in #16351 works well for deployments where query traffic is evenly distributed across all cells containing VTGates. However, it makes a fundamental assumption that may not hold in all topologies: that each cell with a VTGate receives an equal share of inbound application traffic.

In practice, many deployments have uneven traffic distribution across cells.

In these cases, the current model breaks down because it assumes uniform traffic distribution. The balancer calculates flows based on equal input from each VTGate cell, but if most queries actually originate from VTGates in a single cell, the other cells' tablets will be underutilized while attempting to maintain the calculated global balance. Consider this example:

Application deployment:

  • Cell A: 90% of application traffic --> VTGate A
  • Cell B: 10% of application traffic --> VTGate B
  • Cell C: 0% of application traffic (no VTGate, or unused)

Database deployment:

  • Cell A: 1 replica
  • Cell B: 1 replica
  • Cell C: 1 replica

With the flow-based balancer configured with all three cells, each VTGate attempts to balance load equally across all tablets. However, since 90% of queries come through Cell A's VTGate, the tablet in Cell A still receives the majority of the load despite the balancer's efforts. A random strategy:

  • Ignores cell affinity completely
  • Selects tablets with uniform probability (1/N for N available tablets)
  • Makes no assumptions about traffic distribution across VTGate cells
  • Provides predictable load distribution regardless of application deployment topology
  • The random mode trades off potential latency benefits from cell affinity for guaranteed even load distribution.

This may be the right tradeoff when:

  • Application traffic is concentrated in fewer cells than database replicas
  • Cross-cell latency is acceptable for the workload
  • Avoiding replica hotspots is more important than minimizing latency

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions