-
Notifications
You must be signed in to change notification settings - Fork 684
Description
What is the bug?
I am running Grafana Mimir with Ingest Storage (Kafka) enabled. According to the configuration comments and documentation for ingester.ring.replication_factor:
"This configuration is not used when ingest storage is enabled."
However, I have observed that this configuration is critical for query availability (Read Path) when using the Partition Ring (Kafka).
If this value is left at default (which seems to be 1 in some contexts or if not explicitly set to 3) while using Kafka, performing a rolling restart of the Ingester StatefulSet causes immediate query failures.
The error observed in the querier/ruler is:
partition <ID>: too many unhealthy instances in the ring
It appears that while the Write Path (Distributor -> Kafka) relies on Kafka's replication, the Read Path (Ingester consuming Kafka) relies on ingester.ring.replication_factor to determine how many Ingesters consume the same partition. If this is 1, a single Ingester restart results in 100% downtime for that partition's data availability.
How to reproduce it?
- Deploy Mimir on Kubernetes (StatefulSet) with
ingest_storageenabled (using Kafka). - Do not explicitly set
ingester.ring.replication_factorto 3 (leave it at default, or set it to 1). - Start a rolling update of the Ingester StatefulSet (e.g.,
kubectl rollout restart statefulset/mimir-ingester). - Execute PromQL queries continuously during the rollout.
- Observe that when an Ingester pod restarts (even if it's a graceful shutdown), queries fail immediately with 500 errors regarding "unhealthy instances".
What did you think would happen?
Based on the documentation stating "This configuration is not used when ingest storage is enabled", I expected that:
- I did not need to configure
ingester.ring.replication_factorwhen using Kafka. - Mimir would handle Ingester rolling updates gracefully without query interruptions, assuming Kafka ensures data durability.
Suggestion:
The documentation should be updated to clarify that ingester.ring.replication_factor IS used for the Read Path / Partition Ring to ensure High Availability during consumption. It should recommend setting this to 3 (or >1) even when Kafka is enabled to tolerate Ingester restarts.
What was your environment?
Environment:
Mimir version: 3.0.0
Deployment: Kubernetes (StatefulSet)
Kafka enabled: Yes
Additional Context:
runtimeConfig:
overrides:
anonymous:
ingestion_rate: 8000000
ingestion_burst_size: 40000000
max_global_series_per_user: 250000000
max_label_names_per_series: 100
ruler_max_rules_per_rule_group: 40
ingest_storage:
enabled: true
kafka:
address: "xxxx:9092"
topic: "mimir-ingestion"
client_id: "mimir-ingester"
consumer_group: ""
producer_max_record_size_bytes: 10485760
producer_max_buffered_bytes: 1073741824
consume_from_position_at_startup: "last-offset"
Any additional context to share?
No response