Skip to content

[Bug] Controller pod reports 1/1 Ready while KRaft quorum is wedged after cold-boot DNS race; cluster never recovers without manual restart #12760

@mly-zju

Description

@mly-zju

Bug Description

On a freshly provisioned Kafka cluster (dedicated KRaft controllers + brokers, mTLS internal listener), the cluster can end up permanently NotReady after initial bootstrap. The Strimzi cluster operator reconciles forever with:

KafkaRoller$UnforceableProblem: An error while trying to determine the active controller
Caused by: org.apache.kafka.common.errors.TimeoutException:
    Timed out waiting for a node assignment. Call: describeMetadataQuorum

All 6 Kafka pods (3 broker, 3 controller) report 1/1 Running / Ready and the kafka-agent /v1/ready endpoint on brokers returns 204, but the KRaft metadata quorum is wedged. On the controller PVs, __cluster_metadata-0/quorum-state shows leaderId set but appliedOffset permanently at 0 — no metadata records ever applied.

The issue is intermittent — most fresh provisions come up cleanly. It reproduces when CoreDNS happens to be slow to publish the per-pod A records for the headless brokers service relative to Kafka pod startup. In our occurrence the headless Service was created at T+0s, controller pods started at T+57s, and all six pods logged java.net.UnknownHostException against the peer FQDNs from T+62s through T+69s. After T+69s no Kafka pod emitted another stdout line for the next 8 hours, despite the JVMs being alive and the raft log file continuing to grow on disk.

Steps to reproduce

The race is not 100% deterministic via the normal install path. The shape of the deployment that hit it is reflected in the CRs in the "Configuration files and logs" section below.

A deterministic reproduction we have not yet validated but believe should work, based on what we observed in our cluster: provision the cluster of that shape and, before the Kafka pods become Ready, block UDP/53 from the controller pods (e.g. with a NetworkPolicy or iptables rule) for ~60 s, then remove the rule. Expected outcome: controllers stay 1/1 Ready, quorum-state shows appliedOffset: 0, and the operator reconciles forever with Timed out waiting for a node assignment. Call: describeMetadataQuorum.

We're happy to validate the deterministic repro and report back with cleaner artifacts; we wanted to file the report first so others hitting the same symptom can find it.

Expected behavior

Either:

  1. The Kafka controller registration retries indefinitely on transient DNS failure so the cluster self-heals once CoreDNS catches up (root-cause fix lives in Kafka, not Strimzi), and/or
  2. The Strimzi controller-only readiness probe fails when the controller is not actually participating in the quorum, so kubelet restarts the wedged pod automatically.

Strimzi version

0.51.0

(per upstream code inspection, master as of the Kafka 4.3.0 commit 7138b28d7 is still affected — docker-images/kafka-based/kafka/scripts/kafka_readiness.sh for controller-only mode has not changed semantically since the KRaft migration; it's still netstat -lnt | grep ':9090.*LISTEN'.)

Kubernetes version

Amazon EKS 1.30

Installation method

YAML files (Kustomize). The Strimzi cluster operator deployment pins a custom Kafka image via the STRIMZI_KAFKA_IMAGES env var:

- name: STRIMZI_KAFKA_IMAGES
  value: |
    4.2.0=<our-registry>/transporter-kafka-4-2:0.51-kafka-4.2

The custom image at the time of the incident was a thin FROM reg.<internal>/strimzi-kafka:0.51-kafka-4.2 — no script overrides on top.

Infrastructure

Amazon EKS (eu-west-1), CoreDNS, default dnsPolicy: ClusterFirst, no service mesh. Custom (non-Strimzi-generated) cluster CA passed in via Kafka.spec.clusterCa.generateCertificateAuthority: false.

Configuration files and logs

The wedged cluster has since been recovered (rolling controllers one-by-one, leader last, then brokers one-by-one) and was subsequently torn down. Below are:

  • The CRs as produced by our provisioner that day (representative — the spec did not differ between runs)
  • The log snippets we captured verbatim while triaging the live cluster

Kafka CR (representative)

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: edbtp-kf-default
  namespace: transporter-kafka-operator
  annotations:
    strimzi.io/node-pools: enabled
    strimzi.io/kraft: enabled
spec:
  kafka:
    version: 4.2.0
    metadataVersion: 4.2-IV1
    listeners:
      - name: tls
        port: 9098
        type: internal
        tls: true
        authentication:
          type: tls
      - name: external
        port: 9198
        type: loadbalancer
        tls: true
        authentication:
          type: tls
        configuration:
          bootstrap:
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-type: external
              service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
              service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
              # (other AWS LB annotations omitted)
          brokers: [ { broker: 0, ... }, { broker: 1, ... }, { broker: 2, ... } ]
    authorization:
      type: simple
      superUsers:
        - CN=superuser-appliance,OU=Migration,O=EnterpriseDB,C=US
    config:
      offsets.topic.replication.factor: "3"
      transaction.state.log.replication.factor: "3"
      transaction.state.log.min.isr: "2"
      default.replication.factor: "3"
      min.insync.replicas: "2"
      auto.create.topics.enable: "false"
    logging:
      type: external
      valueFrom:
        configMapKeyRef:
          name: strimzi-cluster-operator
          key: log4j.properties
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
    template:
      pod:
        securityContext:
          runAsUser: 10000
          runAsGroup: 10000
          runAsNonRoot: true
          fsGroup: 10000
          seccompProfile: { type: RuntimeDefault }
        imagePullSecrets: [ { name: edb-cred } ]
        # tolerations for control-plane and arm64 nodes
  entityOperator: null
  clusterCa:
    generateCertificateAuthority: false   # custom CA, see secrets
  clientsCa:
    generateCertificateAuthority: false

KafkaNodePool: controller

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: edbtp-kf-default-controller
  namespace: transporter-kafka-operator
  labels:
    strimzi.io/cluster: edbtp-kf-default
spec:
  replicas: 3
  roles: [ controller ]
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 10Gi
        deleteClaim: true
        kraftMetadata: shared

KafkaNodePool: broker

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: edbtp-kf-default-broker
  namespace: transporter-kafka-operator
  labels:
    strimzi.io/cluster: edbtp-kf-default
spec:
  replicas: 3
  roles: [ broker ]
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 50Gi
        deleteClaim: true
        kraftMetadata: shared
  resources:
    requests: { cpu: 300m,  memory: 5Gi  }
    limits:   { cpu: "2",   memory: 15Gi }

Kafka CR status (live, while wedged)

{
  "conditions": [
    {
      "lastTransitionTime": "2026-05-25T00:52:05.196348333Z",
      "message": "An error while trying to determine the active controller",
      "reason": "UnforceableProblem",
      "status": "True",
      "type": "NotReady"
    }
  ],
  "autoRebalance": { "state": "Idle", "lastTransitionTime": "2026-05-25T00:42:57.614065383Z" },
  "kafkaNodePools": [
    { "name": "edbtp-kf-default-broker" },
    { "name": "edbtp-kf-default-controller" }
  ],
  "observedGeneration": 1
}

Note: kafkaVersion, kafkaMetadataVersion, listeners, clusterId, operatorLastSuccessfulVersion were all absent from status while wedged. They appeared only after manual recovery.

Operator log (steady-state, every ~8 minutes, identical stack)

2026-05-25 00:52:05 ERROR AbstractOperator:284 - Reconciliation #1(watch) Kafka(transporter-kafka-operator/edbtp-kf-default): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: An error while trying to determine the active controller
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$deferController$25(KafkaRoller.java:915)
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:840)
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.deferController(KafkaRoller.java:914)
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:435)
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:339)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeMetadataQuorum

Every reconciliation since Reconciliation #1 produced the same stack (we observed it up to Reconciliation #679, ~8 hours later, all identical). No other WARN/ERROR lines from the operator during the wedge.

Controller pod log — last 5 lines (after which the JVM emitted nothing for 8 hours)

(Identical pattern on all three controllers; id= differs.)

2026-05-25 00:44:06 WARN  NetworkClient:1147 - [NodeToControllerChannelManager id=3 name=registration] Error connecting to node edbtp-kf-default-edbtp-kf-default-controller-3.edbtp-kf-default-kafka-brokers.transporter-kafka-operator.svc:9090 (id: 3 rack: null isFenced: false)
java.net.UnknownHostException: edbtp-kf-default-edbtp-kf-default-controller-3.edbtp-kf-default-kafka-brokers.transporter-kafka-operator.svc
        at java.base/java.net.InetAddress$CachedLookup.get(Unknown Source)
        at java.base/java.net.InetAddress.getAllByName0(Unknown Source)
        at java.base/java.net.InetAddress.getAllByName(Unknown Source)
        at org.apache.kafka.clients.DefaultHostResolver.resolve(DefaultHostResolver.java:27)
        at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:125)
        at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.resolveAddresses(ClusterConnectionStates.java:536)
        ...
2026-05-25 00:44:07 ERROR ControllerRegistrationManager:77 - [ControllerRegistrationManager id=3 incarnation=AmWCf053RvikDU9KphtiHQ] RegistrationResponseHandler: channel manager timed out before sending the request.

That last line was the final stdout line from controller-3 for the next 8 hours.

(Brokers ended the same way, with [NodeToControllerChannelManager id=N name=heartbeat] instead of name=registration.)

Raft state on disk (controller-3) — read directly from the PV after 8 h

{"clusterId":"","leaderId":3,"leaderEpoch":1,"votedId":-1,
 "appliedOffset":0,
 "currentVoters":[{"voterId":3},{"voterId":4},{"voterId":5}],
 "data_version":0}

leaderId set, but appliedOffset is 0 and never advances — the local replicated state machine has never applied a record. Meanwhile the metadata log file __cluster_metadata-0/00000000000000000000.log is being written (raft heartbeats), so the JVM is genuinely alive — just stuck in the registration state machine.

Network/TLS verified healthy (during the wedge)

$ kubectl -n <ns> exec <operator-pod> -- openssl s_client \
    -connect <controller-pod>.<brokers-svc>.<ns>.svc:9090 \
    -CAfile ca.crt -cert cluster-operator.crt -key cluster-operator.key
CONNECTED — TLS handshake succeeds, full chain validates

$ kubectl -n <ns> exec <controller-pod> -- getent hosts \
    <controller-pod>.<brokers-svc>.<ns>.svc.cluster.local
<pod-ip>   <controller-pod>.<brokers-svc>.<ns>.svc.cluster.local

So at the time of triage the failure was not network, not TLS, and not current DNS — it's the controller registration state machine giving up after the early DNS failure and never retrying, while Strimzi's port-only probe doesn't notice.

Additional context

Recovery

Rolling controllers one-at-a-time with the leader last (identified by reading quorum-state directly off the PV) forces a new election; the new epoch comes up clean. Then rolling brokers one-at-a-time gets the broker side to re-register. Recovery time ~5 minutes once you notice. Deleting all controllers at once is unsafe.

Responding to "the probes are not great — any ideas what would work better?"

Kafka offers enough — Strimzi's own KRaft Grafana dashboards (packaging/examples/metrics/strimzi-metrics-reporter/grafana-dashboards/strimzi-kraft.json) already consume these:

  • kafka.server:type=raft-metrics, current-state — string: leader, follower, candidate, prospective, voted, unattached, resigned, observer
  • kafka.server:type=raft-metrics, current-leader, current-epoch, high-watermark, log-end-offset
  • kafka.server:type=broker-metadata-metrics, last-applied-record-offset (broker-side)

Concrete proposal — symmetric with the existing broker /v1/ready path:

Add GET /v1/controller-ready to the Kafka Agent. Returns 204 when kafka.server:type=raft-metrics current-state{leader, follower}, 503 otherwise, 404 if the MBean isn't yet registered (graceful fallback). Then kafka_readiness.sh calls this endpoint for controller-only mode after the existing port-listening check.

This is implemented and tested in #12768. Validation covered:

  • 14 unit tests mirroring the existing testReadiness* pattern via a Supplier<String> test injection point — one case per raft state observed in Kafka 4.x, including prospective (the state I first observed during cluster validation, not in code reading).

  • kind cluster (Strimzi 0.51.0 + Kafka 4.2.0): Kafka CR reaches Ready=True, controllers return 204, brokers' /v1/ready/ is unchanged.

  • AKS cluster (Azure CNI, Kafka 4.2.0): same smoke test plus a hostAliases-induced wedge that captured this response body verbatim during the disruption:

    $ kubectl exec controller-N -- curl -sS http://localhost:8080/v1/controller-ready/ -w '|HTTP %{http_code}'
    {"error":"controller not ready, current raft state: prospective"}|HTTP 503
    

    Proves end-to-end that the new endpoint reads the real JMX MBean attribute, not a constant.

One caveat I want to be transparent about

The above patch catches the broad class of "raft is not attached" failures. It detects when current-state drops out of {leader, follower}.

The wedge captured in this incident showed quorum-state with leaderId: 3, leaderEpoch: 1 persistent on disk after 8 hours — meaning raft did elect a leader. We don't have direct evidence of what current-state MBean reported during the wedge (we recovered the cluster before instrumenting it). It's possible the wedge was one layer above raft — in ControllerRegistrationManager — in which case current-state would have been leader on controller-3 and follower on 4/5, and this probe would have returned 204 throughout.

If maintainers want stricter coverage of this specific failure mode, a follow-up adding a last-applied-record-offset > 0 (or "the offset is advancing") check from broker-metadata-metrics would help. Happy to do that as a separate PR.

Even with that caveat, the PR is a strict improvement on the current probe: it replaces "port is listening" with "this controller is part of the quorum", which catches a broader class of failures (bad cert, raft partition, future regressions) regardless of whether it catches this specific ET-5871 wedge variant.

Apologies on the previous draft

Earlier revision of this issue body was AI-summarized prose without the underlying artifacts attached — sorry about that. This revision follows the bug-report template and pastes the verbatim log lines and CR snippets we captured during triage. We have not done a bare-metal reproduction yet, which we agree would help isolate this as a pure Kafka behavior question; happy to invest the time on that if it would move things along.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions