Skip to content

Bug Report: VTTablet & VTGate routing traffic to hung MySQL server #11884

@latentflip

Description

@latentflip

Overview of the Issue

We had an issue today where:

  1. MySQL (5.7) crashed due to a corrupt memory error on the machine
  2. This triggered a core dump, which due to the large amount of memory that needed to be dumped took around 19 minutes
  3. During that time VTGate continued to route traffic to the VTTablet, and those queries hung indefinitely.
  4. Every minute the VTGate would report that it was transitioning the VTTablet from serving true => false but then it would immediately transition it back from false => true
  5. As soon as the core-dump was complete (19 minutes later), and MySQL restarted, the VTTablet spotted that replication lag was now high and took the host offline until it caught up.
  6. Once replication lag caught up, the host was moved back to serving, and things were all 👍

Cause

I believe @dm-2 and I have tracked this down to being caused by a few things:

  1. The code that checks the replication status of the MySQL server passes a context.TODO() to getPoolReconnect:
    conn, err := getPoolReconnect(context.TODO(), mysqld.dbaPool)
    and
    func getPoolReconnect(ctx context.Context, pool *dbconnpool.ConnectionPool) (*dbconnpool.PooledDBConnection, error) {
    conn, err := pool.Get(ctx)
    if err != nil {
    return conn, err
    }
    // Run a test query to see if this connection is still good.
    if _, err := conn.ExecuteFetch("SELECT 1", 1, false); err != nil {
    .
    • As far as I can tell there are no cancellations or timeouts on this code. I believe that means the conn.ExecuteFetch will then block indefinitely if the MySQL host is in a bad state after the crash while it's doing a core-dump
    • The fact that this code blocks, then prevents repltracker.Status() from completing, which prevents the health state of the vttablet updating: lag, err := sm.rt.Status(), and prevents the updates state being broadcast back to the vtgate streamHealth.
  2. On its own, this bug shouldn't be a huge deal, as the vtgate tablet_health_check code will detect that it's not seen an updated serving status come in from the vttablet for it's timeout (in our case 1 minute) and so will transition the vtgate's own state to serving false.
    • This does indeed happen, and we see in the logs that we called closeConnection to transition the tablet to serving false here:
      func (thc *tabletHealthCheck) closeConnection(ctx context.Context, err error) {
      log.Warningf("tablet %v healthcheck stream error: %v", thc.Tablet, err)
      thc.setServingState(false, err.Error())
    • However, after a short delay we will retry, and we will create a new StreamHealth connection to the vttablet to try again. When the initial connection is made, vttablet immediately sends back the "current" health state of the vttablet:
      // Send the current state immediately.
      ch <- proto.Clone(hs.state).(*querypb.StreamHealthResponse)
      . This state will still say that the tablet is healthy, as what we learned from (1) is that we are blocked on setting the state to unhealthy due to the unresponsive mysql server.
    • The vtgate then sees this "healthy" response, and transitions it's view of the vttablet back to serving true and starts routing traffic to it again.
  3. Another side-effect of this we spotted is that the Replication() call from (1) holds the sm.mu.Lock() the whole time, which causes other code to be blocked indefinitely that relies on taking the lock, like /vars/debug which cannot get the current IsServing state because of the lock:
    // IsServing returns true if TabletServer is in SERVING state.
    func (sm *stateManager) IsServing() bool {
    sm.mu.Lock()
    defer sm.mu.Unlock()
    return sm.isServingLocked()
    }

Possible fixes:

It seems like the connection setup, and query execution to mysql in getPoolReconnect needs a timeout to prevent it blocking indefinitely, and to set the vttablet's state to not serving if it cannot fetch replication state from mysql. I think that would resolve all the issues here?

Reproduction Steps

I don't know how to reliably reproduce this with clear steps. I'm guessing the problem would be visible if we crashed mysql in a non-graceful way such that it didn't fully terminate connections (or whatever happens while it's doing a core dump).

I think the analysis above covers it though.

Binary Version

Server version: 5.7.32-vitess Version: 14.0.1 (Git revision 631084ae79181ba816ba2d98bee07c16d8b2f7b4 branch 'master') built on Mon Nov 21 16:30:24 UTC 2022 by root@bcaa51ae028b using go1.18.4 linux/amd64

Operating System and Environment details

I don't know this off-hand but I don't know how pertinent it is given the details above?

Log Fragments

W1205 10:17:52.962960       1 tablet_health_check.go:335] tablet alias:{cell:"ash1" uid:171471394} hostname:"<crashed host>" port_map:{key:"grpc" value:15991} port_map:{key:"vt" value:15101} keyspace:"x_ks" shard:"c0-" key_range:{start:"\xc0"} type:REPLICA db_name_override:"x" mysql_hostname:"<crashed host>" mysql_port:3306 db_server_version:"0.0.0" default_conn_collation:45 healthcheck stream error: Code: CANCELED
I1205 10:17:52.963050       1 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: ash1-171471394 (<crashed host>) serving true => false for x_ks/c0- (REPLICA) reason: vttablet: rpc error: code = Canceled desc = context canceled
I1205 10:17:53.051535       1 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: ash1-171471394 (<crashed host>) serving false => true for x_ks/c0- (REPLICA) reason: healthCheck update

we then see the exact same pattern repeated one minute later, and every minute, until the core-dump completes and the host restarts.

Metadata

Metadata

Type

No type

Projects

  • Status

    Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions