Skip to content

Bug Report: Throttler doesn't ignore connection errors #18022

Open
@mhamza15

Description

@mhamza15

Overview of the Issue

When a tablet is shutdown, it's state in the topology is cleared out and ends up looking like this:

{
  "alias": {
    "cell": "...",
    "uid": 171496207
  },
  "hostname": "",
  "port_map": {},
  "keyspace": "...",
  "shard": "0",
  "key_range": null,
  "type": "REPLICA",
  "db_name_override": "...",
  "tags": {},
  "mysql_hostname": "",
  "mysql_port": 3306,
  "primary_term_start_time": null,
  "default_conn_collation": 45
}

As the hostname and port_map are empty, the throttler will try to connect to this tablet, fail, and report the shard as unhealthy:

"mysql/shard": {
  "LastHealthyAt": "2025-03-20T12:24:14.045597766-07:00",
  "SecondsSinceLastHealthy": 5083
}

...which in the case of VReplication, would stop VReplication entirely. Ideally, the throttler should ignore connection errors. See this slack thread for more context.

Reproduction Steps

make build

cd examples/local

./101_initial_cluster.sh; ./201_customer_tablets.sh; ./202_move_tables.sh

vtctldclient UpdateThrottlerConfig --enable --throttle-app="all" --throttle-app-ratio 0 --throttle-app-duration 4h customer

primaryuid=$(vtctldclient GetTablets --keyspace customer --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)

Observe that the shard is healthy:

$ vtctldclient GetThrottlerStatus zone1-0000000${primaryuid} | jq .status.metrics_health

{
  "last_healthy_at": {
    "seconds": "1742838360",
    "nanoseconds": 529167126
  },
  "seconds_since_last_healthy": "0"
}

Then kill one of the tablets:

$ vtctldclient --server localhost:15999 GetTablets
zone1-0000000100 commerce 0 replica localhost:15100 localhost:17100 [] <null>
zone1-0000000101 commerce 0 primary localhost:15101 localhost:17101 [] 2025-03-24T17:02:42Z
zone1-0000000102 commerce 0 rdonly localhost:15102 localhost:17102 [] <null>
zone1-0000000200 customer 0 replica :0 :17200 [] <null>
zone1-0000000201 customer 0 primary localhost:15201 localhost:17201 [] 2025-03-24T17:05:27Z
zone1-0000000202 customer 0 rdonly localhost:15202 localhost:17202 [] <null>

Note tablet 200 with the empty hostname and port. Now check the throttler again and observe that it is unhealthy:

$ vtctldclient --server localhost:15999 GetThrottlerStatus zone1-0000000${primaryuid} | jq .status.metrics_health.shard
{
  "last_healthy_at": {
    "seconds": "1742838425",
    "nanoseconds": 778510514
  },
  "seconds_since_last_healthy": "162"
}

Binary Version

main

Operating System and Environment details

All

Log Fragments

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions