Replies: 2 comments 1 reply
-
|
Please keep in mind that Strimzi does not support Istio. So there is no Strimzi-approved way to handle anything related to it. While I think there are few users running it in some way, the general advice would be to avoid it. I would be quite curious if you can reproduce the problem without Istio. You suggest the service pointing to the controllers is the issue. But I never saw any problems with it in standard Strimzi deployments and I'm not aware of anyone reporting it. So I wonder if we missed it - you did not share a proper log, but the error you shared seems like something ephemeral and easy to recover from by the client - so it might be easy to miss if it does not cause any real issues. Sadly, the PR that changed it from bootstrap to brokers long time ago does not cover the details. But in between the lines it seems the issue was the DNS resolution which is harder to recover from then one node not responding. |
Beta Was this translation helpful? Give feedback.
-
|
Deploying Kafka with Istio enabled caused issues in service discovery and network resolution. Without the Istio sidecar, all broker IPs resolved correctly through Kubernetes DNS, and C.C. connected to all brokers normally. With the sidecar injected, DNS queries returned only controller node IPs, indicating that Istio filtered or limited DNS responses. C.C. also experienced misrouted traffic — requests to port 9091 were being redirected to controller pods, even though those pods don’t expose that port. This suggests that Istio’s service routing or DNS capture logic may be incorrectly mapping the service endpoints. Disabling the sidecar resolved both issues. These results show that Istio’s sidecar proxy interferes with Kafka’s DNS resolution and port routing, likely requiring adjustments such as disabling DNS capture or excluding Kafka workloads from Istio’s control. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We are facing a connectivity issue between Cruise Control and the Kafka brokers when using Strimzi with Istio and Cilium in our Kubernetes cluster.
Environment
Symptom
Cruise Control itself can connect to the cluster correctly using the
STRIMZI_KAFKA_BOOTSTRAP_SERVERSenv var pointing to{cluster}-kafka-bootstrap:9091(TLS/SASL connection on 9091). This part works fine and Cruise Control is up and running.The problem appears in the Cruise Control Metrics Reporter running inside each broker. Instead of using the same
{cluster}-kafka-bootstrap:9091service, the metrics reporter intentionally targets the headless service{cluster}-kafka-brokers:9091.In our environment, the metrics reporter produces continuous INFO/WARN logs like:
As a result, the
CruiseControlMetricsReporteris never able to successfully produce metrics to the Strimzi Cruise Control metrics topics.Behaviour in Strimzi code
Looking at
KafkaBrokerConfigurationBuilder, the behaviour seems intentional.strimzi-kafka-operator/cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaBrokerConfigurationBuilder.java
Lines 115 to 117 in bada1ec
The comment in the code explains that the metrics reporter uses the brokers headless service instead of the bootstrap service because the Admin client is not able to connect to pods behind the bootstrap service when they are not ready during startup. However, we are unsure if this behaviour is still beneficial, given that if Cruise Control cannot collect metrics from the brokers reliably, it may lead to operational problems such as the
NotEnoughValidWindowsException. This exception impacts Cruise Control’s ability to make balancing decisions, so maintaining reliable metric reporting is critical. Perhaps collecting some metadata information like the number of brokers or partitions, could be useful. IMHO, being unable to produce the actual metrics for the metrics topic could be problematic and limit Cruise Control's functionality.What seems to be happening
Because we are using KRaft, the
{cluster}-kafka-brokersheadless service points to both broker and controller pods by default.When the Cruise Control Metrics Reporter resolves
{cluster}-kafka-brokers:9091, it sometimes targets controllers instead of brokers. Combined with Istio + Cilium, this ends up causing failed TLS/SASL connections and repeated rebootstrapping, and metrics are never successfully produced.From our tests:
If we temporarily edit the
{cluster}-kafka-brokersService and add a selector/matchLabels so that it only selects the broker pods (for example usingstrimzi.io/broker-role: "true"), Cruise Control Metrics Reporter starts working correctly and is able to send metrics for a few minutes.However, during the next Strimzi reconciliation, the operator restores the Service definition and removes our manual label changes, so the headless service again includes both brokers and controllers and the issue returns.
So the functional workaround (a brokers-only headless service) is not stable, because reconciliation reverts it.
Question / Request
Is it still required or recommended for the Cruise Control Metrics Reporter Admin client to use the
{cluster}-kafka-brokersservice instead of the bootstrap service? Could this be made configurable?In KRaft mode, would it make sense for the
{cluster}-kafka-brokersheadless Service to select only broker pods by default, excluding controllers?Is there an upstream-approved way to override the metrics reporter bootstrap service or customise the brokers headless service selector in a way that survives reconciliation?
Also, we've noticed that Kafka controller pods do not have port 9091 enabled, so it does not make sense(IMHO) for Cruise Control Metrics Reporter to try connecting to controllers through the
{cluster}-kafka-brokersheadless service on this port.Has anyone seen similar issues connecting Cruise Control Metrics Reporter through
{cluster}-kafka-brokerscombined withIstio + Cilium?One possible workaround we are considering is creating a Kubernetes NetworkPolicy that blocks port 9091 traffic to controller pods. This could prevent Cruise Control Metrics Reporter from attempting connections to controllers on that port, effectively forcing it to only connect to brokers.
We have other clusters with Strimzi without Istio and Cilium, and it works well.
Any guidance on the “Strimzi‑approved” way to deal with this (or whether this is considered a bug vs. expected behaviour) would be very helpful.
Thank you
Beta Was this translation helpful? Give feedback.
All reactions