Description
Related problem
Kafka seems to be waiting until the end of its 5m timeout to validate that the pending pod that it tried to manually roll did not go ready. This introduces a lot of additional latency in the reconciliation when we know that the pod will never go Ready since it is referencing a PVC that doesn't exist. See logs from rolling one of these pods here
2024-02-15 06:36:30 INFO AbstractOperator:265 - Reconciliation #123(watch) Kafka(kafka/cluster-13000): Kafka cluster-13000 will be checked for creation or modification
2024-02-15 06:36:30 WARN AbstractOperator:557 - Reconciliation #102(timer) Kafka(kafka/cluster-13000): Failed to reconcile
io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource cluster-13000-zookeeper-0 in namespace kafka to be ready
at io.strimzi.operator.common.VertxUtil$1.lambda$handle$1(VertxUtil.java:126) ~[io.strimzi.operator-common-0.39.0.jar:0.39.0]
at io.vertx.core.impl.future.FutureImpl$4.onFailure(FutureImpl.java:188) ~[io.vertx.vertx-core-4.5.0.jar:4.5.0]
at io.vertx.core.impl.future.FutureBase.lambda$emitFailure$1(FutureBase.java:75) ~[io.vertx.vertx-core-4.5.0.jar:4.5.0]
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[io.netty.netty-transport-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.100.Final.jar:4.1.100.Final]
at java.lang.Thread.run(Thread.java:840) ~[?:?]
2024-02-15 06:36:30 WARN ZookeeperLeaderFinder:248 - Reconciliation #123(watch) Kafka(kafka/cluster-13000): ZK cluster-13000-zookeeper-0.cluster-13000-zookeeper-nodes.kafka.svc.cluster.local:2181: failed to connect to zookeeper:
Suggested solution
Ideally, we could short-circuit this operation. Similar to initial pod startup, it would be nice if, when we manually rolled pods, that we recognized immediately that a pod that is referencing a PVC that doesn't exist is never going to go ready, bail out early and allow Strimzi to proceed to create the PVC for the new pod.
Theoretically, this wouldn't even have to occur in a separate loop. You could do this just by having the waiter that is waiting for the new pod to go ready recognize that the PVC that is being referenced doesn't exist on the cluster.
Alternatives
No response
Additional context
No response