Open
Description
Overview of the Issue
When a tablet is shutdown, it's state in the topology is cleared out and ends up looking like this:
{
"alias": {
"cell": "...",
"uid": 171496207
},
"hostname": "",
"port_map": {},
"keyspace": "...",
"shard": "0",
"key_range": null,
"type": "REPLICA",
"db_name_override": "...",
"tags": {},
"mysql_hostname": "",
"mysql_port": 3306,
"primary_term_start_time": null,
"default_conn_collation": 45
}
As the hostname
and port_map
are empty, the throttler will try to connect to this tablet, fail, and report the shard as unhealthy:
"mysql/shard": {
"LastHealthyAt": "2025-03-20T12:24:14.045597766-07:00",
"SecondsSinceLastHealthy": 5083
}
...which in the case of VReplication, would stop VReplication entirely. Ideally, the throttler should ignore connection errors. See this slack thread for more context.
Reproduction Steps
make build
cd examples/local
./101_initial_cluster.sh; ./201_customer_tablets.sh; ./202_move_tables.sh
vtctldclient UpdateThrottlerConfig --enable --throttle-app="all" --throttle-app-ratio 0 --throttle-app-duration 4h customer
primaryuid=$(vtctldclient GetTablets --keyspace customer --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)
Observe that the shard
is healthy:
$ vtctldclient GetThrottlerStatus zone1-0000000${primaryuid} | jq .status.metrics_health
{
"last_healthy_at": {
"seconds": "1742838360",
"nanoseconds": 529167126
},
"seconds_since_last_healthy": "0"
}
Then kill one of the tablets:
$ vtctldclient --server localhost:15999 GetTablets
zone1-0000000100 commerce 0 replica localhost:15100 localhost:17100 [] <null>
zone1-0000000101 commerce 0 primary localhost:15101 localhost:17101 [] 2025-03-24T17:02:42Z
zone1-0000000102 commerce 0 rdonly localhost:15102 localhost:17102 [] <null>
zone1-0000000200 customer 0 replica :0 :17200 [] <null>
zone1-0000000201 customer 0 primary localhost:15201 localhost:17201 [] 2025-03-24T17:05:27Z
zone1-0000000202 customer 0 rdonly localhost:15202 localhost:17202 [] <null>
Note tablet 200 with the empty hostname and port. Now check the throttler again and observe that it is unhealthy:
$ vtctldclient --server localhost:15999 GetThrottlerStatus zone1-0000000${primaryuid} | jq .status.metrics_health.shard
{
"last_healthy_at": {
"seconds": "1742838425",
"nanoseconds": 778510514
},
"seconds_since_last_healthy": "162"
}
Binary Version
main
Operating System and Environment details
All