Open
Description
Summary
Recently we started a new metasrv cluster and migrate some data into it. Unlukyly the databend-query instance get hangs on accessing one metasrv instance like this:
2023-01-29T04:22:07.538495Z WARN common_meta_client::grpc_client: MetaGrpcClient slow request PrefixList to meta-service-2.meta-service.my-system.svc:9191 takes 60002 ms: PrefixList(ListKVReq { prefix: "__fd_clusters/default/default/databend_query" })
However we found that the timeout logs are all about one metasrv instance.
After we reconstruct this instance, the databend-query cluster recovers.
The log shows that the other meta instances are health, but the databend-query are always accessing the bad instance.
Is there something wrong about the load balancing? Maybe we could make some health check about the metasrv endpoints in the client side.
Metadata
Metadata
Assignees
Labels
No labels