Bug: Load balance seems not work on accessing a bad metasrv instance

**Summary**

Recently we started a new metasrv cluster and migrate some data into it. Unlukyly the databend-query  instance get hangs on accessing one metasrv instance like this:

```
  2023-01-29T04:22:07.538495Z  WARN common_meta_client::grpc_client: MetaGrpcClient slow request PrefixList to meta-service-2.meta-service.my-system.svc:9191 takes 60002 ms: PrefixList(ListKVReq { prefix: "__fd_clusters/default/default/databend_query" })
```

However we found that the timeout logs are all about one metasrv instance.

After we reconstruct this instance, the databend-query cluster recovers.

The log shows that the other meta instances are health, but the databend-query are always accessing the bad instance.

Is there something wrong about the load balancing? Maybe we could make some health check about the metasrv endpoints in the client side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Load balance seems not work on accessing a bad metasrv instance #9761

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Load balance seems not work on accessing a bad metasrv instance #9761

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions