-
Notifications
You must be signed in to change notification settings - Fork 882
Description
I see the following floating error.
There are several different applications on .Net, using confluent-kafka-dotnet (different versions).
From time to time (maybe several times a day, maybe once a week), consumers stop reading messages from Kafka.
The consumer does not throw any exception, but simply returns null when the method is called.consume().
All the consumers in the consumer group stop reading at once.
On the broker's side, consumers leave the group and it switches to the EMPTY status.
Applications are deployed as pods in K8S.
Only monitoring the lag and manually restarting the hearth helps.
What we tried and didn't help:
-Updating brokers and client libraries
-Changing session timeouts (6-30 seconds)
-Using a static member id
-Reading only one topic by one consumer group
-Using CancellationToken and with TimeSpan for .consume().
I tried to reproduce the network problems of running Kafka and the consumer locally in a docker container. I turned off the network on the docker's side. There were error messages in the consumer logs, and after the network was restored, the reading continued.
I don't see this problem in other programming languages.
Checklist:
- Apache Kafka 2.6.3 without authorization on virtual machines and Strimzi Kafka in k8s with keycloak authorization
- confluent-kafka-dotnet 2.8.0 and newer
- k8s