Open
Description
Bug Description
We test the AutoRebalance feature and in rare cases it does not do anything.
The issue is very similar to: #11195
The only difference is that the old issue talks about manual rebalance, and the current one is about the automatic one.
Steps to reproduce
- Create Kafka where autorebalance is configured for add/remove brokers:
spec:
cruiseControl:
autoRebalance:
- mode: add-brokers
- mode: remove-brokers
- Scale Kafka up a Kafka cluster.
- Strimzi will detect that add_broker should be called, and it calls CC
- CC is still in rolling restart phase, so in an unlucky situation the old CC gets the request and it does not know any capacity info about new broker, response will be 500:
'com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: java.lang.NullPointerException: Cannot invoke "com.linkedin.kafka.cruisecontrol.config.BrokerCapacityInfo.capacity()"
- The KafkaRebalance resource will be in NotReady state, updated by Strimzi after the error:
ERROR KafkaRebalanceAssemblyOperator:403 - Reconciliation #1287(watch) KafkaRebalance(...):Status updated to [NotReady] due to error:...
- In case the KafkaRebalance is in NotReady state and the scalingNodes does not change in the meantime (So you do not scale further), your KafkaRebalance will simply be deleted by KafkaAutoRebalancingReconciler, and the rebalance will not happen.
Expected behavior
This issue is not permanent, as I mentioned in the steps it seems rare.
Should Strimzi retry this kind of issues? In automatic rebalance case, it is very weird that my rebalance will not happen ever because of a timing issue. Obviously I can create my rebalance manually but that is why I use the automatic one.
Strimzi version
0.45
Kubernetes version
1.28
Installation method
Helm chart
Infrastructure
k3s
Configuration files and logs
No response
Additional context
No response