Skip to content

Bug: querier sends requests to wrong ingesters #13823

@dimitarvdimitrov

Description

@dimitarvdimitrov

What is the bug?

Summary

During a period of elevated latency in zone-B, the ruler-queriers began reporting large numbers of strong-consistency enforcement failures, causing widespread rule evaluation failures. Only one zone exhibited real latency issues at a time, yet both zones were intermittently reported as failing, and some ingesters were flagged as failing even though no gRPC calls were made to them.

Observed ingest receive delays (p100) were well within the 1-minute strong consistency window, suggesting that the failures were not caused by genuine lag but rather by a bug in how failingInstanceID is tracked, propagated, or aggregated during strong-consistency reads.

Image Image

How to reproduce it?

Symptoms

  • Strong-consistency enforcement failures on ingesters (initially zone-A, later zone-B).
  • Dropped rule evaluations, with errors including:
    • "enforce read consistency max delay: partition reader is lagging more than the allowed max delay"
  • False failures from zone-A for exactly 10:58:24–11:03:24, observed only by ruler-queriers.
  • No matching Internal gRPC errors in ingester logs for the failingInstanceIDs reported.
  • Traces showing failingInstanceIDs for ingesters that had no child spans, and missing entries for ingesters that did have RPC failures.
  • Recovery occurred when zone-B’s latency recovered.

Suspected Problem

Evidence strongly suggests a bug in the querier<>ingester error attribution logic.

Signs indicating incorrect attribution:

  • Ruler-queriers report failures for ingesters they never sent RPCs to.
  • Some zone-A ingesters are reported as failing despite no Internal errors.
  • The number of failingInstanceIDs does not match the number of failed RPC spans.
  • p100 receive delays in zone-a stayed under ~16s, well below the configured 1m consistency max delay, yet failures occurred.

This suggests the failure set is being built incorrectly or corrupted, not derived from real ingester behavior.

What did you think would happen?

no consistency failures

What was your environment?

kubernetes

Any additional context to share?

Investigation Steps (Collapsed)

Ingest delay vs consistency thresholds

  • p100 ingest read delays remained under ~16 seconds, far below the configured 'ruler.evaluation-consistency-max-delay': '1m',
  • Under these conditions, strong-consistency checks should not have failed, indicating the failures were likely not caused by real lag.

**Outage timeline tied strictly to zone-B **

  • Failures began only after zone-B latency increased at 10:58 and ended when zone-B recovered.
  • False zone-A failures appeared only in ruler-queriers, further indicating mis-attribution.

Ruler-querier vs ingester error mismatch

  • Ruler-queriers logged Internal errors referencing various failingInstanceIDs.
  • Ingester logs show no matching Internal errors, proving these failures were not actually emitted by those ingesters.
Image Image

Trace inconsistency: failingInstanceIDs without child spans

(GL internal link: 025cec28cc4dcb5c0007c4adc65a2c29)

  • Many traces contain failingInstanceIDs for ingesters that had no spans at all, while other ingesters with actual RPC failures were inconsistently represented.
  • Tempo truncation hid some failures, but even visible mismatches indicate incorrect construction of the failure set.

for example this span: a849c51b0be4b77f (same trace). There are these messages “request to instance has failed, zone cannot contribute to quorum” for the following ingesters

Details

{
  "value": "ingester-zone-b-178",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-280",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-207",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-191",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-55",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-50",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-118",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-39",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-273",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-161",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-b-8",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-a-191",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-a-8",
  "key": "failingInstanceID"
}{
  "value": "ingester-zone-a-118",
  "key": "failingInstanceID"
}

at the same time the ingsters with errored gRPC spans are these

Details

{
  "value": "ingester-zone-b-217",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-168",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-239",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-273",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-39",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-280",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-124",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-207",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-55",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-253",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-47",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-161",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-178",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-254",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-50",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-8",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-118",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-158",
  "key": "k8s.pod.name"
}{
  "value": "ingester-zone-b-191",
  "key": "k8s.pod.name"
}

Image

Ruled out network/IP churn

  • Checked pod IP histories for ingesters and distributors around the outage.
  • No IP changes occurred, ruling out routing instability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions