Open
Description
Summary
Recently we started a new metasrv cluster and migrate some data into it. Unlukyly the databend-query instance get hangs on accessing one metasrv instance like this:
2023-01-29T04:22:07.538495Z WARN common_meta_client::grpc_client: MetaGrpcClient slow request PrefixList to meta-service-2.meta-service.my-system.svc:9191 takes 60002 ms: PrefixList(ListKVReq { prefix: "__fd_clusters/default/default/databend_query" })
However we found that the timeout logs are all about one metasrv instance.
After we reconstruct this instance, the databend-query cluster recovers.
The log shows that the other meta instances are health, but the databend-query are always accessing the bad instance.
Is there something wrong about the load balancing? Maybe we could make some health check about the metasrv endpoints in the client side.