Description
What is the bug?
Since we upgraded to 2.15.0, loosing an ingester causes a lot of latency from the distributors. Removing the ingester from the ring brings latency back to normal.
How to reproduce it?
On a distributed Mimir cluster with:
- multiple instances of ingesters
- distributors with a consequent traffic
- ingester.ring.zone_awareness_enable=true
- ingester.ring.replication_factor=3
- ingester.ring.unregister_on_shudown=false
shut down an ingester
What did you think would happen?
Shutting down an ingester shouldn't make latency go up (like that was the case on previous releases)
What was your environment?
Debian VMs with Mimir 2.15.0 deployed as APT package. 6 distributors, 12 ingesters, spread over 3 geographical zones.
Any additional context to share?
We noticed no config difference between a cluster in 2.15 version and a cluster in 2.14.3 version on ingester, distributor or ingester_client section.
When the problem is occuring, trace shows that a distributor requests takes a lot more time than usual.
Some metrics when shutting down an ingester
Trace of a distributor request to ingesters when latency is healthy
Trace of a distributor request to ingesters when latency is degraded
Activity