Skip to content

[Docs/Bug] 'ingester.ring.replication_factor' critically affects query availability even when ingest storage (Kafka) is enabled #13801

@qiu-peng20

Description

@qiu-peng20

What is the bug?

I am running Grafana Mimir with Ingest Storage (Kafka) enabled. According to the configuration comments and documentation for ingester.ring.replication_factor:

"This configuration is not used when ingest storage is enabled."

However, I have observed that this configuration is critical for query availability (Read Path) when using the Partition Ring (Kafka).

If this value is left at default (which seems to be 1 in some contexts or if not explicitly set to 3) while using Kafka, performing a rolling restart of the Ingester StatefulSet causes immediate query failures.

The error observed in the querier/ruler is:
partition <ID>: too many unhealthy instances in the ring

It appears that while the Write Path (Distributor -> Kafka) relies on Kafka's replication, the Read Path (Ingester consuming Kafka) relies on ingester.ring.replication_factor to determine how many Ingesters consume the same partition. If this is 1, a single Ingester restart results in 100% downtime for that partition's data availability.

How to reproduce it?

  1. Deploy Mimir on Kubernetes (StatefulSet) with ingest_storage enabled (using Kafka).
  2. Do not explicitly set ingester.ring.replication_factor to 3 (leave it at default, or set it to 1).
  3. Start a rolling update of the Ingester StatefulSet (e.g., kubectl rollout restart statefulset/mimir-ingester).
  4. Execute PromQL queries continuously during the rollout.
  5. Observe that when an Ingester pod restarts (even if it's a graceful shutdown), queries fail immediately with 500 errors regarding "unhealthy instances".

What did you think would happen?

Based on the documentation stating "This configuration is not used when ingest storage is enabled", I expected that:

  1. I did not need to configure ingester.ring.replication_factor when using Kafka.
  2. Mimir would handle Ingester rolling updates gracefully without query interruptions, assuming Kafka ensures data durability.

Suggestion:
The documentation should be updated to clarify that ingester.ring.replication_factor IS used for the Read Path / Partition Ring to ensure High Availability during consumption. It should recommend setting this to 3 (or >1) even when Kafka is enabled to tolerate Ingester restarts.

What was your environment?

Environment:

Mimir version: 3.0.0

Deployment: Kubernetes (StatefulSet)

Kafka enabled: Yes

Additional Context:

runtimeConfig:

overrides:

anonymous:

  ingestion_rate: 8000000

  ingestion_burst_size: 40000000

  max_global_series_per_user: 250000000

  max_label_names_per_series: 100

  ruler_max_rules_per_rule_group: 40



ingest_storage:

  enabled: true



  kafka:

    address: "xxxx:9092"

    topic: "mimir-ingestion"

    client_id: "mimir-ingester"

    consumer_group: ""

    producer_max_record_size_bytes: 10485760

    producer_max_buffered_bytes: 1073741824

    consume_from_position_at_startup: "last-offset"

Any additional context to share?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions