Skip to content

PARTCNT change to zero and consuming stops after broker issue #430

@larsha

Description

@larsha

Environment Information

  • Node Version: 22.21.1
  • confluent-kafka-javascript version: 1.6.0

Steps to Reproduce

  • Sometimes when a Kafka broker is shut down, not sure if gracefully or not, but we're seeing it in our Kubernetes cluster sometimes, could be due to some node maintenance, crash or whatever. We notice these errors in some of our consumers:

The errors occur in this order:

  1. FAIL: Connect to ipv4#x.x.x.x:9093 failed: Connection refused (after 0ms in state CONNECT)
  2. BINDING: [LibrdKafkaError: Local: Broker transport failure]
  3. REQTMOUT: Timed out 0 in-flight, 0 retry-queued, 7 out-queue, 0 partially-sent requests
  4. BINDING: [LibrdKafkaError: Local: Host resolution failure]
  5. PARTCNT: [thrd:main]: Topic <redacted> partition count changed from 100 to 0
  6. BINDING: [LibrdKafkaError: Local: Unknown partition]
  7. BINDING: [LibrdKafkaError: Local: All broker connections are down]
  8. BINDING: [LibrdKafkaError: Broker: Unknown topic or partition]
  9. SESSTMOUT: [thrd:main]: Consumer group session timed out (in join-state steady) after 20004 ms without a successful response from the group coordinator (broker -1, last error was Success): revoking assignment and rejoining group
  10. BINDING: Received rebalance event with message: 'Local: Revoke partitions' and 100 partition(s), isLost: true

After the last error, stating that it has rebalanced and revoked all partitons. The consuming stops (of course). But why is this happening? We do have three kafka brokers, all partitions are replicated over all three brokers, and this usually works when draining nodes etc, but we're seeing this issue sometimes, and as a result the consuming stops completely and we need to restart the process manually.

Number 9 in the list states that it should revoke partitions and rejoin the group, but it doesn't rejoin the group it seems, so the consuming can continue?

Seems like some edge case/issue not handled properly or do you have any ideas?

I expect the consumer to retry connection to another broker (which it seems to be doing correctly after the SESSTMOUT, but why is it not able to continue consuming? Number 10 in the list seems to revoke all partitions from the consumer, that is also the last error. So no more consuming is done after that. Should it not be able to continue consuming automatically? Or am I missing something here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions