Description
Overview of the Issue
We had an issue today where:
- MySQL (5.7) crashed due to a corrupt memory error on the machine
- This triggered a core dump, which due to the large amount of memory that needed to be dumped took around 19 minutes
- During that time VTGate continued to route traffic to the VTTablet, and those queries hung indefinitely.
- Every minute the VTGate would report that it was transitioning the VTTablet from
serving true => false
but then it would immediately transition it back fromfalse => true
- As soon as the core-dump was complete (19 minutes later), and MySQL restarted, the VTTablet spotted that replication lag was now high and took the host offline until it caught up.
- Once replication lag caught up, the host was moved back to serving, and things were all 👍
Cause
I believe @dm-2 and I have tracked this down to being caused by a few things:
- The code that checks the replication status of the MySQL server passes a
context.TODO()
togetPoolReconnect
:vitess/go/vt/mysqlctl/replication.go
Line 326 in a36de2c
vitess/go/vt/mysqlctl/query.go
Lines 34 to 40 in a36de2c
- As far as I can tell there are no cancellations or timeouts on this code. I believe that means the
conn.ExecuteFetch
will then block indefinitely if the MySQL host is in a bad state after the crash while it's doing a core-dump - The fact that this code blocks, then prevents repltracker.Status() from completing, which prevents the health state of the vttablet updating: lag, err := sm.rt.Status(), and prevents the updates state being broadcast back to the vtgate streamHealth.
- As far as I can tell there are no cancellations or timeouts on this code. I believe that means the
- On its own, this bug shouldn't be a huge deal, as the vtgate
tablet_health_check
code will detect that it's not seen an updated serving status come in from the vttablet for it's timeout (in our case 1 minute) and so will transition the vtgate's own state toserving false
.- This does indeed happen, and we see in the logs that we called
closeConnection
to transition the tablet toserving false
here:vitess/go/vt/discovery/tablet_health_check.go
Lines 334 to 336 in 7fafc94
- However, after a short delay we will retry, and we will create a new StreamHealth connection to the vttablet to try again. When the initial connection is made, vttablet immediately sends back the "current" health state of the vttablet:
vitess/go/vt/vttablet/tabletserver/health_streamer.go
Lines 211 to 212 in a36de2c
- The vtgate then sees this "healthy" response, and transitions it's view of the vttablet back to
serving true
and starts routing traffic to it again.
- This does indeed happen, and we see in the logs that we called
- Another side-effect of this we spotted is that the
Replication()
call from (1) holds thesm.mu.Lock()
the whole time, which causes other code to be blocked indefinitely that relies on taking the lock, like/vars/debug
which cannot get the currentIsServing
state because of the lock:vitess/go/vt/vttablet/tabletserver/state_manager.go
Lines 678 to 683 in 7fafc94
Possible fixes:
It seems like the connection setup, and query execution to mysql in getPoolReconnect
needs a timeout to prevent it blocking indefinitely, and to set the vttablet's state to not serving if it cannot fetch replication state from mysql. I think that would resolve all the issues here?
Reproduction Steps
I don't know how to reliably reproduce this with clear steps. I'm guessing the problem would be visible if we crashed mysql in a non-graceful way such that it didn't fully terminate connections (or whatever happens while it's doing a core dump).
I think the analysis above covers it though.
Binary Version
Server version: 5.7.32-vitess Version: 14.0.1 (Git revision 631084ae79181ba816ba2d98bee07c16d8b2f7b4 branch 'master') built on Mon Nov 21 16:30:24 UTC 2022 by root@bcaa51ae028b using go1.18.4 linux/amd64
Operating System and Environment details
I don't know this off-hand but I don't know how pertinent it is given the details above?
Log Fragments
W1205 10:17:52.962960 1 tablet_health_check.go:335] tablet alias:{cell:"ash1" uid:171471394} hostname:"<crashed host>" port_map:{key:"grpc" value:15991} port_map:{key:"vt" value:15101} keyspace:"x_ks" shard:"c0-" key_range:{start:"\xc0"} type:REPLICA db_name_override:"x" mysql_hostname:"<crashed host>" mysql_port:3306 db_server_version:"0.0.0" default_conn_collation:45 healthcheck stream error: Code: CANCELED
I1205 10:17:52.963050 1 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: ash1-171471394 (<crashed host>) serving true => false for x_ks/c0- (REPLICA) reason: vttablet: rpc error: code = Canceled desc = context canceled
I1205 10:17:53.051535 1 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: ash1-171471394 (<crashed host>) serving false => true for x_ks/c0- (REPLICA) reason: healthCheck update
we then see the exact same pattern repeated one minute later, and every minute, until the core-dump completes and the host restarts.
Metadata
Metadata
Assignees
Type
Projects
Status
Done