Skip to content

Bug: Load balance seems not work on accessing a bad metasrv instance #9761

Open
@flaneur2020

Description

@flaneur2020

Summary

Recently we started a new metasrv cluster and migrate some data into it. Unlukyly the databend-query instance get hangs on accessing one metasrv instance like this:

  2023-01-29T04:22:07.538495Z  WARN common_meta_client::grpc_client: MetaGrpcClient slow request PrefixList to meta-service-2.meta-service.my-system.svc:9191 takes 60002 ms: PrefixList(ListKVReq { prefix: "__fd_clusters/default/default/databend_query" })

However we found that the timeout logs are all about one metasrv instance.

After we reconstruct this instance, the databend-query cluster recovers.

The log shows that the other meta instances are health, but the databend-query are always accessing the bad instance.

Is there something wrong about the load balancing? Maybe we could make some health check about the metasrv endpoints in the client side.

Metadata

Metadata

Assignees

Labels

A-metaArea: databend meta serive

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions