Bug: querier sends requests to wrong ingesters

### What is the bug?

## **Summary**

During a period of elevated latency in zone-B, the ruler-queriers began reporting large numbers of strong-consistency enforcement failures, causing widespread rule evaluation failures. Only one zone exhibited real latency issues at a time, yet both zones were intermittently reported as failing, and some ingesters were flagged as failing even though **no gRPC calls were made to them**.

Observed ingest receive delays (p100) were well within the 1-minute strong consistency window, suggesting that the failures were **not caused by genuine lag** but rather by a bug in how failingInstanceID is tracked, propagated, or aggregated during strong-consistency reads.




<img width="3570" height="868" alt="Image" src="https://github.com/user-attachments/assets/5a47b6b7-1fe0-466e-9278-8dbb62994a79" />

<img width="3360" height="1204" alt="Image" src="https://github.com/user-attachments/assets/5a7e763e-b1a5-41c8-af01-eed2f266268f" />



### How to reproduce it?

## **Symptoms**

* **Strong-consistency enforcement failures** on ingesters (initially zone-A, later zone-B).
* **Dropped rule evaluations**, with errors including:
  * `"enforce read consistency max delay: partition reader is lagging more than the allowed max delay"` 
* **False failures from zone-A** for exactly **10:58:24–11:03:24**, observed only by ruler-queriers.
* **No matching Internal gRPC errors** in ingester logs for the failingInstanceIDs reported.
* **Traces showing failingInstanceIDs for ingesters that had no child spans**, and missing entries for ingesters that did have RPC failures.
* Recovery occurred when zone-B’s latency recovered.

## **Suspected Problem**

Evidence strongly suggests a **bug in the querier<>ingester error attribution logic**.

Signs indicating incorrect attribution:

* Ruler-queriers report failures for ingesters **they never sent RPCs to**.
* Some zone-A ingesters are reported as failing **despite no Internal errors**.
* The number of failingInstanceIDs does **not match** the number of failed RPC spans.
* p100 receive delays in zone-a stayed under ~16s, well below the configured **1m consistency max delay**, yet failures occurred.

This suggests the failure set is being **built incorrectly or corrupted**, not derived from real ingester behavior.

### What did you think would happen?

no consistency failures

### What was your environment?

kubernetes

### Any additional context to share?


## **Investigation Steps (Collapsed)**

### **Ingest delay vs consistency thresholds**

* **p100 ingest read delays remained under ~16 seconds**, far below the configured `'ruler.evaluation-consistency-max-delay': '1m',`
* Under these conditions, strong-consistency checks **should not have failed**, indicating the failures were likely **not caused by real lag**.




### **Outage timeline tied strictly to zone-B **

* Failures began **only after zone-B latency increased at 10:58** and ended when zone-B recovered.
* False zone-A failures appeared **only in ruler-queriers**, further indicating mis-attribution.


### **Ruler-querier vs ingester error mismatch**

* Ruler-queriers logged Internal errors referencing various failingInstanceIDs.
* Ingester logs show **no matching Internal errors**, proving these failures were not actually emitted by those ingesters.

<img width="3350" height="1806" alt="Image" src="https://github.com/user-attachments/assets/859e3927-3cd6-46f6-b420-1521cfda8456" />

<img width="3274" height="1348" alt="Image" src="https://github.com/user-attachments/assets/18c392ef-8232-4591-9045-398b08fda572" />



### **Trace inconsistency: failingInstanceIDs without child spans**

(GL internal link: [025cec28cc4dcb5c0007c4adc65a2c29](https://ops.grafana-ops.net/explore?schemaVersion=1&panes=%7B%22m4x%22:%7B%22datasource%22:%22--%20Mixed%20--%22,%22queries%22:%5B%7B%22refId%22:%22C%22,%22datasource%22:%7B%22type%22:%22tempo%22,%22uid%22:%22grafanacloud-traces%22%7D,%22queryType%22:%22traceql%22,%22limit%22:20,%22tableType%22:%22traces%22,%22metricsQueryType%22:%22range%22,%22serviceMapUseNativeHistograms%22:false,%22query%22:%22%7Bresource.k8s.deployment.name%3D%5C%22ruler-querier%5C%22%20%26%26%20resource.k8s.namespace.name%3D%5C%22mimir-ops-03%5C%22%20%26%26%20name%3D%5C%22Distributor.QueryStream%5C%22%20%26%26%20event.zone%3D%5C%22zone-a%5C%22%20%26%26%20event.err%3D%5C%22rpc%20error:%20code%20%3D%20Internal%20desc%20%3D%20enforce%20read%20consistency%20max%20delay:%20partition%20reader%20is%20lagging%20behind%20more%20than%20the%20allowed%20max%20delay%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%221765277869264%22,%22to%22:%221765278243171%22%7D,%22panelsState%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D%7D,%22compact%22:false%7D,%22pnk%22:%7B%22datasource%22:%22grafanacloud-traces%22,%22queries%22:%5B%7B%22query%22:%2225cec28cc4dcb5c0007c4adc65a2c29%22,%22queryType%22:%22traceql%22,%22datasource%22:%7B%22uid%22:%22grafanacloud-traces%22%7D,%22refId%22:%22A%22%7D%5D,%22range%22:%7B%22from%22:%221765277869264%22,%22to%22:%221765278243171%22%7D,%22panelsState%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22trace%22:%7B%22spanFilters%22:%7B%22spanNameOperator%22:%22%3D%22,%22serviceNameOperator%22:%22%3D%22,%22fromOperator%22:%22%3E%22,%22toOperator%22:%22%3C%22,%22tags%22:%5B%7B%22id%22:%22321dce28-f64%22,%22operator%22:%22%3D%22%7D%5D,%22adhocFilters%22:%5B%7B%22key%22:%22failingInstanceID%22,%22operator%22:%22%3D%22,%22value%22:%22ingester-zone-a-118%22,%22keyLabel%22:%22failingInstanceID%22,%22valueLabels%22:%5B%22ingester-zone-a-118%22%5D%7D%5D%7D%7D%7D,%22compact%22:true%7D%7D&orgId=1))

* Many traces contain failingInstanceIDs for ingesters that **had no spans at all**, while other ingesters **with actual RPC failures** were inconsistently represented.
* Tempo truncation hid some failures, but even visible mismatches indicate **incorrect construction of the failure set**.

for example this span: [a849c51b0be4b77f](https://ops.grafana-ops.net/explore?schemaVersion=1&panes=%7B%22m4x%22:%7B%22datasource%22:%22--%20Mixed%20--%22,%22queries%22:%5B%7B%22refId%22:%22C%22,%22datasource%22:%7B%22type%22:%22tempo%22,%22uid%22:%22grafanacloud-traces%22%7D,%22queryType%22:%22traceql%22,%22limit%22:20,%22tableType%22:%22traces%22,%22metricsQueryType%22:%22range%22,%22serviceMapUseNativeHistograms%22:false,%22query%22:%22%7Bresource.k8s.deployment.name%3D%5C%22ruler-querier%5C%22%20%26%26%20resource.k8s.namespace.name%3D%5C%22mimir-ops-03%5C%22%20%26%26%20name%3D%5C%22Distributor.QueryStream%5C%22%20%26%26%20event.zone%3D%5C%22zone-a%5C%22%20%26%26%20event.err%3D%5C%22rpc%20error:%20code%20%3D%20Internal%20desc%20%3D%20enforce%20read%20consistency%20max%20delay:%20partition%20reader%20is%20lagging%20behind%20more%20than%20the%20allowed%20max%20delay%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%221765277869264%22,%22to%22:%221765278243171%22%7D,%22panelsState%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22logs%22:%7B%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D,%22visualisationType%22:%22logs%22%7D%7D,%22compact%22:false%7D,%22xbo%22:%7B%22datasource%22:%22grafanacloud-traces%22,%22queries%22:%5B%7B%22query%22:%22025cec28cc4dcb5c0007c4adc65a2c29%22,%22queryType%22:%22traceql%22,%22datasource%22:%7B%22uid%22:%22grafanacloud-traces%22%7D,%22refId%22:%22A%22,%22limit%22:20,%22tableType%22:%22traces%22,%22metricsQueryType%22:%22range%22,%22serviceMapUseNativeHistograms%22:false%7D%5D,%22range%22:%7B%22from%22:%221765277869264%22,%22to%22:%221765278243171%22%7D,%22panelsState%22:%7B%22trace%22:%7B%22spanId%22:%22a849c51b0be4b77f%22,%22spanFilters%22:%7B%22spanNameOperator%22:%22%3D%22,%22serviceNameOperator%22:%22%3D%22,%22fromOperator%22:%22%3E%22,%22toOperator%22:%22%3C%22,%22tags%22:%5B%7B%22id%22:%22d5986151-7f3%22,%22operator%22:%22%3D%22%7D%5D,%22adhocFilters%22:%5B%7B%22key%22:%22failingInstanceID%22,%22operator%22:%22%3D%22,%22value%22:%22ingester-zone-a-191%22,%22keyLabel%22:%22failingInstanceID%22,%22valueLabels%22:%5B%22ingester-zone-a-191%22%5D%7D%5D%7D%7D%7D,%22compact%22:true%7D%7D&orgId=1) (same trace). There are these messages “request to instance has failed, zone cannot contribute to quorum” for the following ingesters

<details><summary>Details</summary>
<p>


```
{
  "value": "ingester-zone-b-178",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-280",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-207",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-191",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-55",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-50",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-118",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-39",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-273",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-161",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-8",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-a-191",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-a-8",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-a-118",
  "key": "failingInstanceID"
}
```

</p>
</details>

at the same time the ingsters with errored gRPC spans are these


<details><summary>Details</summary>
<p>

```
{
  "value": "ingester-zone-b-217",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-168",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-239",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-273",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-39",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-280",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-124",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-207",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-55",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-253",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-47",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-161",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-178",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-254",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-50",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-8",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-118",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-158",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-191",
  "key": "k8s.pod.name"
}
```


</p>
</details>

<img width="885" height="413" alt="Image" src="https://github.com/user-attachments/assets/c8cda828-5323-4527-b1b9-c2a5d9caa617" />

### **Ruled out network/IP churn**

* Checked pod IP histories for ingesters and distributors around the outage.
* **No IP changes** occurred, ruling out routing instability.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: querier sends requests to wrong ingesters #13823

What is the bug?

Summary

How to reproduce it?

Symptoms

Suspected Problem

What did you think would happen?

What was your environment?

Any additional context to share?

Investigation Steps (Collapsed)

Ingest delay vs consistency thresholds

Outage timeline tied strictly to zone-B

Ruler-querier vs ingester error mismatch

Trace inconsistency: failingInstanceIDs without child spans

Ruled out network/IP churn

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: querier sends requests to wrong ingesters #13823

Description

What is the bug?

Summary

How to reproduce it?

Symptoms

Suspected Problem

What did you think would happen?

What was your environment?

Any additional context to share?

Investigation Steps (Collapsed)

Ingest delay vs consistency thresholds

**Outage timeline tied strictly to zone-B **

Ruler-querier vs ingester error mismatch

Trace inconsistency: failingInstanceIDs without child spans

Ruled out network/IP churn

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Outage timeline tied strictly to zone-B