[Bug] Controller pod reports 1/1 Ready while KRaft quorum is wedged after cold-boot DNS race; cluster never recovers without manual restart

### Bug Description

On a freshly provisioned Kafka cluster (dedicated KRaft controllers + brokers, mTLS internal listener), the cluster can end up permanently `NotReady` after initial bootstrap. The Strimzi cluster operator reconciles forever with:

```
KafkaRoller$UnforceableProblem: An error while trying to determine the active controller
Caused by: org.apache.kafka.common.errors.TimeoutException:
    Timed out waiting for a node assignment. Call: describeMetadataQuorum
```

All 6 Kafka pods (3 broker, 3 controller) report `1/1 Running` / `Ready` and the kafka-agent `/v1/ready` endpoint on brokers returns `204`, but the KRaft metadata quorum is wedged. On the controller PVs, `__cluster_metadata-0/quorum-state` shows `leaderId` set but `appliedOffset` permanently at `0` — no metadata records ever applied.

The issue is **intermittent** — most fresh provisions come up cleanly. It reproduces when CoreDNS happens to be slow to publish the per-pod A records for the headless brokers service relative to Kafka pod startup. In our occurrence the headless `Service` was created at `T+0s`, controller pods started at `T+57s`, and all six pods logged `java.net.UnknownHostException` against the peer FQDNs from `T+62s` through `T+69s`. After `T+69s` **no Kafka pod emitted another stdout line for the next 8 hours**, despite the JVMs being alive and the raft log file continuing to grow on disk.

### Steps to reproduce

The race is not 100% deterministic via the normal install path. The shape of the deployment that hit it is reflected in the CRs in the "Configuration files and logs" section below.

A deterministic reproduction we have not yet validated but believe should work, based on what we observed in our cluster: provision the cluster of that shape and, before the Kafka pods become Ready, block UDP/53 from the controller pods (e.g. with a NetworkPolicy or iptables rule) for ~60 s, then remove the rule. Expected outcome: controllers stay `1/1 Ready`, `quorum-state` shows `appliedOffset: 0`, and the operator reconciles forever with `Timed out waiting for a node assignment. Call: describeMetadataQuorum`.

We're happy to validate the deterministic repro and report back with cleaner artifacts; we wanted to file the report first so others hitting the same symptom can find it.

### Expected behavior

Either:
1. The Kafka controller registration retries indefinitely on transient DNS failure so the cluster self-heals once CoreDNS catches up (root-cause fix lives in Kafka, not Strimzi), and/or
2. The Strimzi controller-only readiness probe fails when the controller is not actually participating in the quorum, so kubelet restarts the wedged pod automatically.

### Strimzi version

0.51.0

(per upstream code inspection, master as of the Kafka 4.3.0 commit `7138b28d7` is still affected — `docker-images/kafka-based/kafka/scripts/kafka_readiness.sh` for controller-only mode has not changed semantically since the KRaft migration; it's still `netstat -lnt | grep ':9090.*LISTEN'`.)

### Kubernetes version

Amazon EKS 1.30

### Installation method

YAML files (Kustomize). The Strimzi cluster operator deployment pins a custom Kafka image via the `STRIMZI_KAFKA_IMAGES` env var:

```yaml
- name: STRIMZI_KAFKA_IMAGES
  value: |
    4.2.0=<our-registry>/transporter-kafka-4-2:0.51-kafka-4.2
```

The custom image at the time of the incident was a thin `FROM reg.<internal>/strimzi-kafka:0.51-kafka-4.2` — no script overrides on top.

### Infrastructure

Amazon EKS (eu-west-1), CoreDNS, default `dnsPolicy: ClusterFirst`, no service mesh. Custom (non-Strimzi-generated) cluster CA passed in via `Kafka.spec.clusterCa.generateCertificateAuthority: false`.

### Configuration files and logs

> The wedged cluster has since been recovered (rolling controllers one-by-one, leader last, then brokers one-by-one) and was subsequently torn down. Below are:
> - The CRs as produced by our provisioner that day (representative — the spec did not differ between runs)
> - The log snippets we captured verbatim while triaging the live cluster

#### Kafka CR (representative)

```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: edbtp-kf-default
  namespace: transporter-kafka-operator
  annotations:
    strimzi.io/node-pools: enabled
    strimzi.io/kraft: enabled
spec:
  kafka:
    version: 4.2.0
    metadataVersion: 4.2-IV1
    listeners:
      - name: tls
        port: 9098
        type: internal
        tls: true
        authentication:
          type: tls
      - name: external
        port: 9198
        type: loadbalancer
        tls: true
        authentication:
          type: tls
        configuration:
          bootstrap:
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-type: external
              service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
              service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
              # (other AWS LB annotations omitted)
          brokers: [ { broker: 0, ... }, { broker: 1, ... }, { broker: 2, ... } ]
    authorization:
      type: simple
      superUsers:
        - CN=superuser-appliance,OU=Migration,O=EnterpriseDB,C=US
    config:
      offsets.topic.replication.factor: "3"
      transaction.state.log.replication.factor: "3"
      transaction.state.log.min.isr: "2"
      default.replication.factor: "3"
      min.insync.replicas: "2"
      auto.create.topics.enable: "false"
    logging:
      type: external
      valueFrom:
        configMapKeyRef:
          name: strimzi-cluster-operator
          key: log4j.properties
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
    template:
      pod:
        securityContext:
          runAsUser: 10000
          runAsGroup: 10000
          runAsNonRoot: true
          fsGroup: 10000
          seccompProfile: { type: RuntimeDefault }
        imagePullSecrets: [ { name: edb-cred } ]
        # tolerations for control-plane and arm64 nodes
  entityOperator: null
  clusterCa:
    generateCertificateAuthority: false   # custom CA, see secrets
  clientsCa:
    generateCertificateAuthority: false
```

#### KafkaNodePool: controller

```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: edbtp-kf-default-controller
  namespace: transporter-kafka-operator
  labels:
    strimzi.io/cluster: edbtp-kf-default
spec:
  replicas: 3
  roles: [ controller ]
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 10Gi
        deleteClaim: true
        kraftMetadata: shared
```

#### KafkaNodePool: broker

```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: edbtp-kf-default-broker
  namespace: transporter-kafka-operator
  labels:
    strimzi.io/cluster: edbtp-kf-default
spec:
  replicas: 3
  roles: [ broker ]
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 50Gi
        deleteClaim: true
        kraftMetadata: shared
  resources:
    requests: { cpu: 300m,  memory: 5Gi  }
    limits:   { cpu: "2",   memory: 15Gi }
```

#### Kafka CR status (live, while wedged)

```json
{
  "conditions": [
    {
      "lastTransitionTime": "2026-05-25T00:52:05.196348333Z",
      "message": "An error while trying to determine the active controller",
      "reason": "UnforceableProblem",
      "status": "True",
      "type": "NotReady"
    }
  ],
  "autoRebalance": { "state": "Idle", "lastTransitionTime": "2026-05-25T00:42:57.614065383Z" },
  "kafkaNodePools": [
    { "name": "edbtp-kf-default-broker" },
    { "name": "edbtp-kf-default-controller" }
  ],
  "observedGeneration": 1
}
```

Note: `kafkaVersion`, `kafkaMetadataVersion`, `listeners`, `clusterId`, `operatorLastSuccessfulVersion` were all **absent** from `status` while wedged. They appeared only after manual recovery.

#### Operator log (steady-state, every ~8 minutes, identical stack)

```
2026-05-25 00:52:05 ERROR AbstractOperator:284 - Reconciliation #1(watch) Kafka(transporter-kafka-operator/edbtp-kf-default): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: An error while trying to determine the active controller
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$deferController$25(KafkaRoller.java:915)
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:840)
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.deferController(KafkaRoller.java:914)
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:435)
        at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:339)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeMetadataQuorum
```

Every reconciliation since `Reconciliation #1` produced the same stack (we observed it up to `Reconciliation #679`, ~8 hours later, all identical). No other WARN/ERROR lines from the operator during the wedge.

#### Controller pod log — last 5 lines (after which the JVM emitted nothing for 8 hours)

(Identical pattern on all three controllers; `id=` differs.)

```
2026-05-25 00:44:06 WARN  NetworkClient:1147 - [NodeToControllerChannelManager id=3 name=registration] Error connecting to node edbtp-kf-default-edbtp-kf-default-controller-3.edbtp-kf-default-kafka-brokers.transporter-kafka-operator.svc:9090 (id: 3 rack: null isFenced: false)
java.net.UnknownHostException: edbtp-kf-default-edbtp-kf-default-controller-3.edbtp-kf-default-kafka-brokers.transporter-kafka-operator.svc
        at java.base/java.net.InetAddress$CachedLookup.get(Unknown Source)
        at java.base/java.net.InetAddress.getAllByName0(Unknown Source)
        at java.base/java.net.InetAddress.getAllByName(Unknown Source)
        at org.apache.kafka.clients.DefaultHostResolver.resolve(DefaultHostResolver.java:27)
        at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:125)
        at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.resolveAddresses(ClusterConnectionStates.java:536)
        ...
2026-05-25 00:44:07 ERROR ControllerRegistrationManager:77 - [ControllerRegistrationManager id=3 incarnation=AmWCf053RvikDU9KphtiHQ] RegistrationResponseHandler: channel manager timed out before sending the request.
```

That last line was the final stdout line from controller-3 for the next 8 hours.

(Brokers ended the same way, with `[NodeToControllerChannelManager id=N name=heartbeat]` instead of `name=registration`.)

#### Raft state on disk (controller-3) — read directly from the PV after 8 h

```json
{"clusterId":"","leaderId":3,"leaderEpoch":1,"votedId":-1,
 "appliedOffset":0,
 "currentVoters":[{"voterId":3},{"voterId":4},{"voterId":5}],
 "data_version":0}
```

`leaderId` set, but `appliedOffset` is 0 and never advances — the local replicated state machine has never applied a record. Meanwhile the metadata log file `__cluster_metadata-0/00000000000000000000.log` *is* being written (raft heartbeats), so the JVM is genuinely alive — just stuck in the registration state machine.

#### Network/TLS verified healthy (during the wedge)

```
$ kubectl -n <ns> exec <operator-pod> -- openssl s_client \
    -connect <controller-pod>.<brokers-svc>.<ns>.svc:9090 \
    -CAfile ca.crt -cert cluster-operator.crt -key cluster-operator.key
CONNECTED — TLS handshake succeeds, full chain validates

$ kubectl -n <ns> exec <controller-pod> -- getent hosts \
    <controller-pod>.<brokers-svc>.<ns>.svc.cluster.local
<pod-ip>   <controller-pod>.<brokers-svc>.<ns>.svc.cluster.local
```

So at the time of triage the failure was not network, not TLS, and not current DNS — it's the controller registration state machine giving up after the early DNS failure and never retrying, while Strimzi's port-only probe doesn't notice.

### Additional context

#### Recovery

Rolling controllers one-at-a-time with the leader last (identified by reading `quorum-state` directly off the PV) forces a new election; the new epoch comes up clean. Then rolling brokers one-at-a-time gets the broker side to re-register. Recovery time ~5 minutes once you notice. Deleting all controllers at once is unsafe.

#### Responding to "the probes are not great — any ideas what would work better?"

Kafka offers enough — Strimzi's own KRaft Grafana dashboards (`packaging/examples/metrics/strimzi-metrics-reporter/grafana-dashboards/strimzi-kraft.json`) already consume these:

- `kafka.server:type=raft-metrics, current-state` — string: `leader`, `follower`, `candidate`, `prospective`, `voted`, `unattached`, `resigned`, `observer`
- `kafka.server:type=raft-metrics, current-leader`, `current-epoch`, `high-watermark`, `log-end-offset`
- `kafka.server:type=broker-metadata-metrics, last-applied-record-offset` (broker-side)

Concrete proposal — symmetric with the existing broker `/v1/ready` path:

Add `GET /v1/controller-ready` to the Kafka Agent. Returns 204 when `kafka.server:type=raft-metrics current-state` ∈ `{leader, follower}`, 503 otherwise, 404 if the MBean isn't yet registered (graceful fallback). Then `kafka_readiness.sh` calls this endpoint for controller-only mode after the existing port-listening check.

This is implemented and tested in **strimzi/strimzi-kafka-operator#12768**. Validation covered:

- **14 unit tests** mirroring the existing `testReadiness*` pattern via a `Supplier<String>` test injection point — one case per raft state observed in Kafka 4.x, including `prospective` (the state I first observed during cluster validation, not in code reading).
- **kind cluster (Strimzi 0.51.0 + Kafka 4.2.0)**: Kafka CR reaches `Ready=True`, controllers return 204, brokers' `/v1/ready/` is unchanged.
- **AKS cluster (Azure CNI, Kafka 4.2.0)**: same smoke test plus a `hostAliases`-induced wedge that captured this response body verbatim during the disruption:

   ```
   $ kubectl exec controller-N -- curl -sS http://localhost:8080/v1/controller-ready/ -w '|HTTP %{http_code}'
   {"error":"controller not ready, current raft state: prospective"}|HTTP 503
   ```

   Proves end-to-end that the new endpoint reads the real JMX MBean attribute, not a constant.

#### One caveat I want to be transparent about

The above patch catches the broad class of "raft is not attached" failures. It detects when `current-state` drops out of `{leader, follower}`.

The wedge captured in this incident showed `quorum-state` with `leaderId: 3, leaderEpoch: 1` *persistent* on disk after 8 hours — meaning raft did elect a leader. We don't have direct evidence of what `current-state` MBean reported during the wedge (we recovered the cluster before instrumenting it). It's possible the wedge was one layer above raft — in `ControllerRegistrationManager` — in which case `current-state` would have been `leader` on controller-3 and `follower` on 4/5, and this probe would have returned 204 throughout.

If maintainers want stricter coverage of this specific failure mode, a follow-up adding a `last-applied-record-offset > 0` (or "the offset is advancing") check from `broker-metadata-metrics` would help. Happy to do that as a separate PR.

Even with that caveat, the PR is a strict improvement on the current probe: it replaces "port is listening" with "this controller is part of the quorum", which catches a broader class of failures (bad cert, raft partition, future regressions) regardless of whether it catches this specific ET-5871 wedge variant.

#### Apologies on the previous draft

Earlier revision of this issue body was AI-summarized prose without the underlying artifacts attached — sorry about that. This revision follows the bug-report template and pastes the verbatim log lines and CR snippets we captured during triage. We have not done a bare-metal reproduction yet, which we agree would help isolate this as a pure Kafka behavior question; happy to invest the time on that if it would move things along.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Controller pod reports 1/1 Ready while KRaft quorum is wedged after cold-boot DNS race; cluster never recovers without manual restart #12760

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Kafka CR (representative)

KafkaNodePool: controller

KafkaNodePool: broker

Kafka CR status (live, while wedged)

Operator log (steady-state, every ~8 minutes, identical stack)

Controller pod log — last 5 lines (after which the JVM emitted nothing for 8 hours)

Raft state on disk (controller-3) — read directly from the PV after 8 h

Network/TLS verified healthy (during the wedge)

Additional context

Recovery

Responding to "the probes are not great — any ideas what would work better?"

One caveat I want to be transparent about

Apologies on the previous draft

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] Controller pod reports 1/1 Ready while KRaft quorum is wedged after cold-boot DNS race; cluster never recovers without manual restart #12760

Description

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Kafka CR (representative)

KafkaNodePool: controller

KafkaNodePool: broker

Kafka CR status (live, while wedged)

Operator log (steady-state, every ~8 minutes, identical stack)

Controller pod log — last 5 lines (after which the JVM emitted nothing for 8 hours)

Raft state on disk (controller-3) — read directly from the PV after 8 h

Network/TLS verified healthy (during the wedge)

Additional context

Recovery

Responding to "the probes are not great — any ideas what would work better?"

One caveat I want to be transparent about

Apologies on the previous draft

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions