Bug Description
On a freshly provisioned Kafka cluster (dedicated KRaft controllers + brokers, mTLS internal listener), the cluster can end up permanently NotReady after initial bootstrap. The Strimzi cluster operator reconciles forever with:
KafkaRoller$UnforceableProblem: An error while trying to determine the active controller
Caused by: org.apache.kafka.common.errors.TimeoutException:
Timed out waiting for a node assignment. Call: describeMetadataQuorum
All 6 Kafka pods (3 broker, 3 controller) report 1/1 Running / Ready and the kafka-agent /v1/ready endpoint on brokers returns 204, but the KRaft metadata quorum is wedged. On the controller PVs, __cluster_metadata-0/quorum-state shows leaderId set but appliedOffset permanently at 0 — no metadata records ever applied.
The issue is intermittent — most fresh provisions come up cleanly. It reproduces when CoreDNS happens to be slow to publish the per-pod A records for the headless brokers service relative to Kafka pod startup. In our occurrence the headless Service was created at T+0s, controller pods started at T+57s, and all six pods logged java.net.UnknownHostException against the peer FQDNs from T+62s through T+69s. After T+69s no Kafka pod emitted another stdout line for the next 8 hours, despite the JVMs being alive and the raft log file continuing to grow on disk.
Steps to reproduce
The race is not 100% deterministic via the normal install path. The shape of the deployment that hit it is reflected in the CRs in the "Configuration files and logs" section below.
A deterministic reproduction we have not yet validated but believe should work, based on what we observed in our cluster: provision the cluster of that shape and, before the Kafka pods become Ready, block UDP/53 from the controller pods (e.g. with a NetworkPolicy or iptables rule) for ~60 s, then remove the rule. Expected outcome: controllers stay 1/1 Ready, quorum-state shows appliedOffset: 0, and the operator reconciles forever with Timed out waiting for a node assignment. Call: describeMetadataQuorum.
We're happy to validate the deterministic repro and report back with cleaner artifacts; we wanted to file the report first so others hitting the same symptom can find it.
Expected behavior
Either:
- The Kafka controller registration retries indefinitely on transient DNS failure so the cluster self-heals once CoreDNS catches up (root-cause fix lives in Kafka, not Strimzi), and/or
- The Strimzi controller-only readiness probe fails when the controller is not actually participating in the quorum, so kubelet restarts the wedged pod automatically.
Strimzi version
0.51.0
(per upstream code inspection, master as of the Kafka 4.3.0 commit 7138b28d7 is still affected — docker-images/kafka-based/kafka/scripts/kafka_readiness.sh for controller-only mode has not changed semantically since the KRaft migration; it's still netstat -lnt | grep ':9090.*LISTEN'.)
Kubernetes version
Amazon EKS 1.30
Installation method
YAML files (Kustomize). The Strimzi cluster operator deployment pins a custom Kafka image via the STRIMZI_KAFKA_IMAGES env var:
- name: STRIMZI_KAFKA_IMAGES
value: |
4.2.0=<our-registry>/transporter-kafka-4-2:0.51-kafka-4.2
The custom image at the time of the incident was a thin FROM reg.<internal>/strimzi-kafka:0.51-kafka-4.2 — no script overrides on top.
Infrastructure
Amazon EKS (eu-west-1), CoreDNS, default dnsPolicy: ClusterFirst, no service mesh. Custom (non-Strimzi-generated) cluster CA passed in via Kafka.spec.clusterCa.generateCertificateAuthority: false.
Configuration files and logs
The wedged cluster has since been recovered (rolling controllers one-by-one, leader last, then brokers one-by-one) and was subsequently torn down. Below are:
- The CRs as produced by our provisioner that day (representative — the spec did not differ between runs)
- The log snippets we captured verbatim while triaging the live cluster
Kafka CR (representative)
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: edbtp-kf-default
namespace: transporter-kafka-operator
annotations:
strimzi.io/node-pools: enabled
strimzi.io/kraft: enabled
spec:
kafka:
version: 4.2.0
metadataVersion: 4.2-IV1
listeners:
- name: tls
port: 9098
type: internal
tls: true
authentication:
type: tls
- name: external
port: 9198
type: loadbalancer
tls: true
authentication:
type: tls
configuration:
bootstrap:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: external
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
# (other AWS LB annotations omitted)
brokers: [ { broker: 0, ... }, { broker: 1, ... }, { broker: 2, ... } ]
authorization:
type: simple
superUsers:
- CN=superuser-appliance,OU=Migration,O=EnterpriseDB,C=US
config:
offsets.topic.replication.factor: "3"
transaction.state.log.replication.factor: "3"
transaction.state.log.min.isr: "2"
default.replication.factor: "3"
min.insync.replicas: "2"
auto.create.topics.enable: "false"
logging:
type: external
valueFrom:
configMapKeyRef:
name: strimzi-cluster-operator
key: log4j.properties
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
template:
pod:
securityContext:
runAsUser: 10000
runAsGroup: 10000
runAsNonRoot: true
fsGroup: 10000
seccompProfile: { type: RuntimeDefault }
imagePullSecrets: [ { name: edb-cred } ]
# tolerations for control-plane and arm64 nodes
entityOperator: null
clusterCa:
generateCertificateAuthority: false # custom CA, see secrets
clientsCa:
generateCertificateAuthority: false
KafkaNodePool: controller
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: edbtp-kf-default-controller
namespace: transporter-kafka-operator
labels:
strimzi.io/cluster: edbtp-kf-default
spec:
replicas: 3
roles: [ controller ]
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 10Gi
deleteClaim: true
kraftMetadata: shared
KafkaNodePool: broker
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: edbtp-kf-default-broker
namespace: transporter-kafka-operator
labels:
strimzi.io/cluster: edbtp-kf-default
spec:
replicas: 3
roles: [ broker ]
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 50Gi
deleteClaim: true
kraftMetadata: shared
resources:
requests: { cpu: 300m, memory: 5Gi }
limits: { cpu: "2", memory: 15Gi }
Kafka CR status (live, while wedged)
{
"conditions": [
{
"lastTransitionTime": "2026-05-25T00:52:05.196348333Z",
"message": "An error while trying to determine the active controller",
"reason": "UnforceableProblem",
"status": "True",
"type": "NotReady"
}
],
"autoRebalance": { "state": "Idle", "lastTransitionTime": "2026-05-25T00:42:57.614065383Z" },
"kafkaNodePools": [
{ "name": "edbtp-kf-default-broker" },
{ "name": "edbtp-kf-default-controller" }
],
"observedGeneration": 1
}
Note: kafkaVersion, kafkaMetadataVersion, listeners, clusterId, operatorLastSuccessfulVersion were all absent from status while wedged. They appeared only after manual recovery.
Operator log (steady-state, every ~8 minutes, identical stack)
2026-05-25 00:52:05 ERROR AbstractOperator:284 - Reconciliation #1(watch) Kafka(transporter-kafka-operator/edbtp-kf-default): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: An error while trying to determine the active controller
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$deferController$25(KafkaRoller.java:915)
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:840)
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.deferController(KafkaRoller.java:914)
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:435)
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:339)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeMetadataQuorum
Every reconciliation since Reconciliation #1 produced the same stack (we observed it up to Reconciliation #679, ~8 hours later, all identical). No other WARN/ERROR lines from the operator during the wedge.
Controller pod log — last 5 lines (after which the JVM emitted nothing for 8 hours)
(Identical pattern on all three controllers; id= differs.)
2026-05-25 00:44:06 WARN NetworkClient:1147 - [NodeToControllerChannelManager id=3 name=registration] Error connecting to node edbtp-kf-default-edbtp-kf-default-controller-3.edbtp-kf-default-kafka-brokers.transporter-kafka-operator.svc:9090 (id: 3 rack: null isFenced: false)
java.net.UnknownHostException: edbtp-kf-default-edbtp-kf-default-controller-3.edbtp-kf-default-kafka-brokers.transporter-kafka-operator.svc
at java.base/java.net.InetAddress$CachedLookup.get(Unknown Source)
at java.base/java.net.InetAddress.getAllByName0(Unknown Source)
at java.base/java.net.InetAddress.getAllByName(Unknown Source)
at org.apache.kafka.clients.DefaultHostResolver.resolve(DefaultHostResolver.java:27)
at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:125)
at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.resolveAddresses(ClusterConnectionStates.java:536)
...
2026-05-25 00:44:07 ERROR ControllerRegistrationManager:77 - [ControllerRegistrationManager id=3 incarnation=AmWCf053RvikDU9KphtiHQ] RegistrationResponseHandler: channel manager timed out before sending the request.
That last line was the final stdout line from controller-3 for the next 8 hours.
(Brokers ended the same way, with [NodeToControllerChannelManager id=N name=heartbeat] instead of name=registration.)
Raft state on disk (controller-3) — read directly from the PV after 8 h
{"clusterId":"","leaderId":3,"leaderEpoch":1,"votedId":-1,
"appliedOffset":0,
"currentVoters":[{"voterId":3},{"voterId":4},{"voterId":5}],
"data_version":0}
leaderId set, but appliedOffset is 0 and never advances — the local replicated state machine has never applied a record. Meanwhile the metadata log file __cluster_metadata-0/00000000000000000000.log is being written (raft heartbeats), so the JVM is genuinely alive — just stuck in the registration state machine.
Network/TLS verified healthy (during the wedge)
$ kubectl -n <ns> exec <operator-pod> -- openssl s_client \
-connect <controller-pod>.<brokers-svc>.<ns>.svc:9090 \
-CAfile ca.crt -cert cluster-operator.crt -key cluster-operator.key
CONNECTED — TLS handshake succeeds, full chain validates
$ kubectl -n <ns> exec <controller-pod> -- getent hosts \
<controller-pod>.<brokers-svc>.<ns>.svc.cluster.local
<pod-ip> <controller-pod>.<brokers-svc>.<ns>.svc.cluster.local
So at the time of triage the failure was not network, not TLS, and not current DNS — it's the controller registration state machine giving up after the early DNS failure and never retrying, while Strimzi's port-only probe doesn't notice.
Additional context
Recovery
Rolling controllers one-at-a-time with the leader last (identified by reading quorum-state directly off the PV) forces a new election; the new epoch comes up clean. Then rolling brokers one-at-a-time gets the broker side to re-register. Recovery time ~5 minutes once you notice. Deleting all controllers at once is unsafe.
Responding to "the probes are not great — any ideas what would work better?"
Kafka offers enough — Strimzi's own KRaft Grafana dashboards (packaging/examples/metrics/strimzi-metrics-reporter/grafana-dashboards/strimzi-kraft.json) already consume these:
kafka.server:type=raft-metrics, current-state — string: leader, follower, candidate, prospective, voted, unattached, resigned, observer
kafka.server:type=raft-metrics, current-leader, current-epoch, high-watermark, log-end-offset
kafka.server:type=broker-metadata-metrics, last-applied-record-offset (broker-side)
Concrete proposal — symmetric with the existing broker /v1/ready path:
Add GET /v1/controller-ready to the Kafka Agent. Returns 204 when kafka.server:type=raft-metrics current-state ∈ {leader, follower}, 503 otherwise, 404 if the MBean isn't yet registered (graceful fallback). Then kafka_readiness.sh calls this endpoint for controller-only mode after the existing port-listening check.
This is implemented and tested in #12768. Validation covered:
-
14 unit tests mirroring the existing testReadiness* pattern via a Supplier<String> test injection point — one case per raft state observed in Kafka 4.x, including prospective (the state I first observed during cluster validation, not in code reading).
-
kind cluster (Strimzi 0.51.0 + Kafka 4.2.0): Kafka CR reaches Ready=True, controllers return 204, brokers' /v1/ready/ is unchanged.
-
AKS cluster (Azure CNI, Kafka 4.2.0): same smoke test plus a hostAliases-induced wedge that captured this response body verbatim during the disruption:
$ kubectl exec controller-N -- curl -sS http://localhost:8080/v1/controller-ready/ -w '|HTTP %{http_code}'
{"error":"controller not ready, current raft state: prospective"}|HTTP 503
Proves end-to-end that the new endpoint reads the real JMX MBean attribute, not a constant.
One caveat I want to be transparent about
The above patch catches the broad class of "raft is not attached" failures. It detects when current-state drops out of {leader, follower}.
The wedge captured in this incident showed quorum-state with leaderId: 3, leaderEpoch: 1 persistent on disk after 8 hours — meaning raft did elect a leader. We don't have direct evidence of what current-state MBean reported during the wedge (we recovered the cluster before instrumenting it). It's possible the wedge was one layer above raft — in ControllerRegistrationManager — in which case current-state would have been leader on controller-3 and follower on 4/5, and this probe would have returned 204 throughout.
If maintainers want stricter coverage of this specific failure mode, a follow-up adding a last-applied-record-offset > 0 (or "the offset is advancing") check from broker-metadata-metrics would help. Happy to do that as a separate PR.
Even with that caveat, the PR is a strict improvement on the current probe: it replaces "port is listening" with "this controller is part of the quorum", which catches a broader class of failures (bad cert, raft partition, future regressions) regardless of whether it catches this specific ET-5871 wedge variant.
Apologies on the previous draft
Earlier revision of this issue body was AI-summarized prose without the underlying artifacts attached — sorry about that. This revision follows the bug-report template and pastes the verbatim log lines and CR snippets we captured during triage. We have not done a bare-metal reproduction yet, which we agree would help isolate this as a pure Kafka behavior question; happy to invest the time on that if it would move things along.
Bug Description
On a freshly provisioned Kafka cluster (dedicated KRaft controllers + brokers, mTLS internal listener), the cluster can end up permanently
NotReadyafter initial bootstrap. The Strimzi cluster operator reconciles forever with:All 6 Kafka pods (3 broker, 3 controller) report
1/1 Running/Readyand the kafka-agent/v1/readyendpoint on brokers returns204, but the KRaft metadata quorum is wedged. On the controller PVs,__cluster_metadata-0/quorum-stateshowsleaderIdset butappliedOffsetpermanently at0— no metadata records ever applied.The issue is intermittent — most fresh provisions come up cleanly. It reproduces when CoreDNS happens to be slow to publish the per-pod A records for the headless brokers service relative to Kafka pod startup. In our occurrence the headless
Servicewas created atT+0s, controller pods started atT+57s, and all six pods loggedjava.net.UnknownHostExceptionagainst the peer FQDNs fromT+62sthroughT+69s. AfterT+69sno Kafka pod emitted another stdout line for the next 8 hours, despite the JVMs being alive and the raft log file continuing to grow on disk.Steps to reproduce
The race is not 100% deterministic via the normal install path. The shape of the deployment that hit it is reflected in the CRs in the "Configuration files and logs" section below.
A deterministic reproduction we have not yet validated but believe should work, based on what we observed in our cluster: provision the cluster of that shape and, before the Kafka pods become Ready, block UDP/53 from the controller pods (e.g. with a NetworkPolicy or iptables rule) for ~60 s, then remove the rule. Expected outcome: controllers stay
1/1 Ready,quorum-stateshowsappliedOffset: 0, and the operator reconciles forever withTimed out waiting for a node assignment. Call: describeMetadataQuorum.We're happy to validate the deterministic repro and report back with cleaner artifacts; we wanted to file the report first so others hitting the same symptom can find it.
Expected behavior
Either:
Strimzi version
0.51.0
(per upstream code inspection, master as of the Kafka 4.3.0 commit
7138b28d7is still affected —docker-images/kafka-based/kafka/scripts/kafka_readiness.shfor controller-only mode has not changed semantically since the KRaft migration; it's stillnetstat -lnt | grep ':9090.*LISTEN'.)Kubernetes version
Amazon EKS 1.30
Installation method
YAML files (Kustomize). The Strimzi cluster operator deployment pins a custom Kafka image via the
STRIMZI_KAFKA_IMAGESenv var:The custom image at the time of the incident was a thin
FROM reg.<internal>/strimzi-kafka:0.51-kafka-4.2— no script overrides on top.Infrastructure
Amazon EKS (eu-west-1), CoreDNS, default
dnsPolicy: ClusterFirst, no service mesh. Custom (non-Strimzi-generated) cluster CA passed in viaKafka.spec.clusterCa.generateCertificateAuthority: false.Configuration files and logs
Kafka CR (representative)
KafkaNodePool: controller
KafkaNodePool: broker
Kafka CR status (live, while wedged)
{ "conditions": [ { "lastTransitionTime": "2026-05-25T00:52:05.196348333Z", "message": "An error while trying to determine the active controller", "reason": "UnforceableProblem", "status": "True", "type": "NotReady" } ], "autoRebalance": { "state": "Idle", "lastTransitionTime": "2026-05-25T00:42:57.614065383Z" }, "kafkaNodePools": [ { "name": "edbtp-kf-default-broker" }, { "name": "edbtp-kf-default-controller" } ], "observedGeneration": 1 }Note:
kafkaVersion,kafkaMetadataVersion,listeners,clusterId,operatorLastSuccessfulVersionwere all absent fromstatuswhile wedged. They appeared only after manual recovery.Operator log (steady-state, every ~8 minutes, identical stack)
Every reconciliation since
Reconciliation #1produced the same stack (we observed it up toReconciliation #679, ~8 hours later, all identical). No other WARN/ERROR lines from the operator during the wedge.Controller pod log — last 5 lines (after which the JVM emitted nothing for 8 hours)
(Identical pattern on all three controllers;
id=differs.)That last line was the final stdout line from controller-3 for the next 8 hours.
(Brokers ended the same way, with
[NodeToControllerChannelManager id=N name=heartbeat]instead ofname=registration.)Raft state on disk (controller-3) — read directly from the PV after 8 h
{"clusterId":"","leaderId":3,"leaderEpoch":1,"votedId":-1, "appliedOffset":0, "currentVoters":[{"voterId":3},{"voterId":4},{"voterId":5}], "data_version":0}leaderIdset, butappliedOffsetis 0 and never advances — the local replicated state machine has never applied a record. Meanwhile the metadata log file__cluster_metadata-0/00000000000000000000.logis being written (raft heartbeats), so the JVM is genuinely alive — just stuck in the registration state machine.Network/TLS verified healthy (during the wedge)
So at the time of triage the failure was not network, not TLS, and not current DNS — it's the controller registration state machine giving up after the early DNS failure and never retrying, while Strimzi's port-only probe doesn't notice.
Additional context
Recovery
Rolling controllers one-at-a-time with the leader last (identified by reading
quorum-statedirectly off the PV) forces a new election; the new epoch comes up clean. Then rolling brokers one-at-a-time gets the broker side to re-register. Recovery time ~5 minutes once you notice. Deleting all controllers at once is unsafe.Responding to "the probes are not great — any ideas what would work better?"
Kafka offers enough — Strimzi's own KRaft Grafana dashboards (
packaging/examples/metrics/strimzi-metrics-reporter/grafana-dashboards/strimzi-kraft.json) already consume these:kafka.server:type=raft-metrics, current-state— string:leader,follower,candidate,prospective,voted,unattached,resigned,observerkafka.server:type=raft-metrics, current-leader,current-epoch,high-watermark,log-end-offsetkafka.server:type=broker-metadata-metrics, last-applied-record-offset(broker-side)Concrete proposal — symmetric with the existing broker
/v1/readypath:Add
GET /v1/controller-readyto the Kafka Agent. Returns 204 whenkafka.server:type=raft-metrics current-state∈{leader, follower}, 503 otherwise, 404 if the MBean isn't yet registered (graceful fallback). Thenkafka_readiness.shcalls this endpoint for controller-only mode after the existing port-listening check.This is implemented and tested in #12768. Validation covered:
14 unit tests mirroring the existing
testReadiness*pattern via aSupplier<String>test injection point — one case per raft state observed in Kafka 4.x, includingprospective(the state I first observed during cluster validation, not in code reading).kind cluster (Strimzi 0.51.0 + Kafka 4.2.0): Kafka CR reaches
Ready=True, controllers return 204, brokers'/v1/ready/is unchanged.AKS cluster (Azure CNI, Kafka 4.2.0): same smoke test plus a
hostAliases-induced wedge that captured this response body verbatim during the disruption:Proves end-to-end that the new endpoint reads the real JMX MBean attribute, not a constant.
One caveat I want to be transparent about
The above patch catches the broad class of "raft is not attached" failures. It detects when
current-statedrops out of{leader, follower}.The wedge captured in this incident showed
quorum-statewithleaderId: 3, leaderEpoch: 1persistent on disk after 8 hours — meaning raft did elect a leader. We don't have direct evidence of whatcurrent-stateMBean reported during the wedge (we recovered the cluster before instrumenting it). It's possible the wedge was one layer above raft — inControllerRegistrationManager— in which casecurrent-statewould have beenleaderon controller-3 andfolloweron 4/5, and this probe would have returned 204 throughout.If maintainers want stricter coverage of this specific failure mode, a follow-up adding a
last-applied-record-offset > 0(or "the offset is advancing") check frombroker-metadata-metricswould help. Happy to do that as a separate PR.Even with that caveat, the PR is a strict improvement on the current probe: it replaces "port is listening" with "this controller is part of the quorum", which catches a broader class of failures (bad cert, raft partition, future regressions) regardless of whether it catches this specific ET-5871 wedge variant.
Apologies on the previous draft
Earlier revision of this issue body was AI-summarized prose without the underlying artifacts attached — sorry about that. This revision follows the bug-report template and pastes the verbatim log lines and CR snippets we captured during triage. We have not done a bare-metal reproduction yet, which we agree would help isolate this as a pure Kafka behavior question; happy to invest the time on that if it would move things along.