Bug Report: VTTablet & VTGate routing traffic to hung MySQL server

### Overview of the Issue

We had an issue today where:

1. MySQL (5.7) crashed due to a corrupt memory error on the machine
2. This triggered a core dump, which due to the large amount of memory that needed to be dumped took around 19 minutes
3. During that time VTGate continued to route traffic to the VTTablet, and those queries hung indefinitely.
4. Every minute the VTGate would report that it was transitioning the VTTablet from `serving true => false` but then it would immediately transition it back from `false => true`
5. As soon as the core-dump was complete (19 minutes later), and MySQL restarted, the VTTablet spotted that replication lag was now high and took the host offline until it caught up.
6. Once replication lag caught up, the host was moved back to serving, and things were all :+1:


### Cause

I believe @dm-2 and I have tracked this down to being caused by a few things:

1. The code that checks the replication status of the MySQL server passes a `context.TODO()` to `getPoolReconnect`: https://github.com/vitessio/vitess/blob/a36de2c036e9ec9055ec11220ffabe6b1ed13e5d/go/vt/mysqlctl/replication.go#L326 and https://github.com/vitessio/vitess/blob/a36de2c036e9ec9055ec11220ffabe6b1ed13e5d/go/vt/mysqlctl/query.go#L34-L40. 
    - As far as I can tell there are no cancellations or timeouts on this code. I believe that means the `conn.ExecuteFetch` will then block indefinitely if the MySQL host is in a bad state after the crash while it's doing a core-dump
    - The fact that this code blocks, then prevents [repltracker.Status()](https://github.com/vitessio/vitess/blob/7fafc94671d746d4827a4b98e37f65036fd9bb29/go/vt/vttablet/tabletserver/repltracker/repltracker.go#L143) from completing, which prevents the health state of the vttablet updating: [lag, err := sm.rt.Status()](https://github.com/vitessio/vitess/blob/7fafc94671d746d4827a4b98e37f65036fd9bb29/go/vt/vttablet/tabletserver/state_manager.go#L637-L642), and prevents the updates state being broadcast back to the vtgate streamHealth.
2. On its own, this bug _shouldn't_ be a huge deal, as the vtgate `tablet_health_check` code will detect that it's not seen an updated serving status come in from the [vttablet for it's timeout](https://github.com/vitessio/vitess/blob/7fafc94671d746d4827a4b98e37f65036fd9bb29/go/vt/discovery/tablet_health_check.go#L264-L267) (in our case 1 minute) and so will transition the vtgate's own state to `serving false`.
    - This does indeed happen, and we see in the logs that we called `closeConnection` to transition the tablet to `serving false` here: https://github.com/vitessio/vitess/blob/7fafc94671d746d4827a4b98e37f65036fd9bb29/go/vt/discovery/tablet_health_check.go#L334-L336
    - However, after a short delay we will retry, and we will create a new StreamHealth connection to the vttablet to try again. When the initial connection is made, vttablet immediately sends back the "current" health state of the vttablet: https://github.com/vitessio/vitess/blob/a36de2c036e9ec9055ec11220ffabe6b1ed13e5d/go/vt/vttablet/tabletserver/health_streamer.go#L211-L212. This state will still say that the tablet is healthy, as what we learned from (1) is that we are blocked on setting the state to unhealthy due to the unresponsive mysql server.
    - The vtgate then sees this "healthy" response, and transitions it's view of the vttablet back to `serving true` and starts routing traffic to it again.
3. Another side-effect of this we spotted is that the `Replication()` call from (1) holds the `sm.mu.Lock()` the whole time, which causes other code to be blocked indefinitely that relies on taking the lock, like `/vars/debug` which cannot get the current `IsServing` state because of the lock: https://github.com/vitessio/vitess/blob/7fafc94671d746d4827a4b98e37f65036fd9bb29/go/vt/vttablet/tabletserver/state_manager.go#L678-L683

### Possible fixes:

It seems like the connection setup, and query execution to mysql in `getPoolReconnect` needs a timeout to prevent it blocking indefinitely, and to set the vttablet's state to not serving if it cannot fetch replication state from mysql. I think that would resolve all the issues here?


### Reproduction Steps

I don't know how to reliably reproduce this with clear steps. I'm guessing the problem would be visible if we crashed mysql in a non-graceful way such that it didn't fully terminate connections (or whatever happens while it's doing a core dump).

I think the analysis above covers it though.

### Binary Version

```sh
Server version: 5.7.32-vitess Version: 14.0.1 (Git revision 631084ae79181ba816ba2d98bee07c16d8b2f7b4 branch 'master') built on Mon Nov 21 16:30:24 UTC 2022 by root@bcaa51ae028b using go1.18.4 linux/amd64
```


### Operating System and Environment details

```sh
I don't know this off-hand but I don't know how pertinent it is given the details above?
```


### Log Fragments

```sh
W1205 10:17:52.962960       1 tablet_health_check.go:335] tablet alias:{cell:"ash1" uid:171471394} hostname:"<crashed host>" port_map:{key:"grpc" value:15991} port_map:{key:"vt" value:15101} keyspace:"x_ks" shard:"c0-" key_range:{start:"\xc0"} type:REPLICA db_name_override:"x" mysql_hostname:"<crashed host>" mysql_port:3306 db_server_version:"0.0.0" default_conn_collation:45 healthcheck stream error: Code: CANCELED
```

```
I1205 10:17:52.963050       1 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: ash1-171471394 (<crashed host>) serving true => false for x_ks/c0- (REPLICA) reason: vttablet: rpc error: code = Canceled desc = context canceled
```

```
I1205 10:17:53.051535       1 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: ash1-171471394 (<crashed host>) serving false => true for x_ks/c0- (REPLICA) reason: healthCheck update
```

we then see the exact same pattern repeated one minute later, and every minute, until the core-dump completes and the host restarts.
```


	func getPoolReconnect(ctx context.Context, pool dbconnpool.ConnectionPool) (dbconnpool.PooledDBConnection, error) {
	conn, err := pool.Get(ctx)
	if err != nil {
	return conn, err
	}
	// Run a test query to see if this connection is still good.
	if _, err := conn.ExecuteFetch("SELECT 1", 1, false); err != nil {

	// IsServing returns true if TabletServer is in SERVING state.
	func (sm *stateManager) IsServing() bool {
	sm.mu.Lock()
	defer sm.mu.Unlock()
	return sm.isServingLocked()
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Report: VTTablet & VTGate routing traffic to hung MySQL server #11884

Overview of the Issue

Cause

Possible fixes:

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	func (thc *tabletHealthCheck) closeConnection(ctx context.Context, err error) {
	log.Warningf("tablet %v healthcheck stream error: %v", thc.Tablet, err)
	thc.setServingState(false, err.Error())

	// Send the current state immediately.
	ch <- proto.Clone(hs.state).(*querypb.StreamHealthResponse)

Bug Report: VTTablet & VTGate routing traffic to hung MySQL server #11884

Description

Overview of the Issue

Cause

Possible fixes:

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions