-
Notifications
You must be signed in to change notification settings - Fork 32
PARTCNT change to zero and consuming stops after broker issue #430
Description
Environment Information
- Node Version:
22.21.1 - confluent-kafka-javascript version:
1.6.0
Steps to Reproduce
- Sometimes when a Kafka broker is shut down, not sure if gracefully or not, but we're seeing it in our Kubernetes cluster sometimes, could be due to some node maintenance, crash or whatever. We notice these errors in some of our consumers:
The errors occur in this order:
FAIL:Connect to ipv4#x.x.x.x:9093 failed: Connection refused (after 0ms in state CONNECT)BINDING:[LibrdKafkaError: Local: Broker transport failure]REQTMOUT:Timed out 0 in-flight, 0 retry-queued, 7 out-queue, 0 partially-sent requestsBINDING:[LibrdKafkaError: Local: Host resolution failure]PARTCNT:[thrd:main]: Topic <redacted> partition count changed from 100 to 0BINDING:[LibrdKafkaError: Local: Unknown partition]BINDING:[LibrdKafkaError: Local: All broker connections are down]BINDING:[LibrdKafkaError: Broker: Unknown topic or partition]SESSTMOUT:[thrd:main]: Consumer group session timed out (in join-state steady) after 20004 ms without a successful response from the group coordinator (broker -1, last error was Success): revoking assignment and rejoining groupBINDING:Received rebalance event with message: 'Local: Revoke partitions' and 100 partition(s), isLost: true
After the last error, stating that it has rebalanced and revoked all partitons. The consuming stops (of course). But why is this happening? We do have three kafka brokers, all partitions are replicated over all three brokers, and this usually works when draining nodes etc, but we're seeing this issue sometimes, and as a result the consuming stops completely and we need to restart the process manually.
Number 9 in the list states that it should revoke partitions and rejoin the group, but it doesn't rejoin the group it seems, so the consuming can continue?
Seems like some edge case/issue not handled properly or do you have any ideas?
I expect the consumer to retry connection to another broker (which it seems to be doing correctly after the SESSTMOUT, but why is it not able to continue consuming? Number 10 in the list seems to revoke all partitions from the consumer, that is also the last error. So no more consuming is done after that. Should it not be able to continue consuming automatically? Or am I missing something here?