Clients do not re-subscribe to all queues after a cluster error #1916
Replies: 3 comments 2 replies
-
|
@PaulKVCare FYI, our team does not use issues for questions and you haven't provided any evidence of a bug: no reproduction steps, no logs, no traffic capture. We do not guess in this community. For example, if you list only one endpoint in connection parameters, this client wouldn't have anywhere else to reconnect to, there is no cluster nodes discovery unlike in the stream client in the protocol. There is a dedicated doc section on what exactly the connection recovery feature does in this client. Several people have been trying to find a way to reproduce something similar in #1871. No one has provided any specific steps and the connection recovery feature does work as advertised most of the time, assuming that you list several endpoints when connecting. Automatic connection recovery can also be tripped up by certain topology configurations, famously with exclusive queues with client-provided names which was mostly mediated only recently with a RabbitMQ change. There is a natural race condition between RabbitMQ deleting exclusive queues on client connection loss and recovering clients trying to re-declare them with the same name. Without logs from all nodes we cannot tell if that might be the case. |
Beta Was this translation helpful? Give feedback.
-
Yes, actually, I would consider that to be a partition. The short answer is that this library should re-connect successfully in a scenario like what you describe. Neither I nor other members of Team RabbitMQ have the time to provide free support for a complex scenario such as this, unless you make it very, very easy to reproduce the exact behavior you describe. This project could be used as a starting point, since a docker-based cluster makes it easy-ish to simulate network issues. What we should be doing is spending time diagnosing your issue, not reproducing it. As @michaelklishin noted, if you make your specific issue easy to reproduce it could potentially help other users of this library and RabbitMQ, and would be a major contribution to advancing both projects. Thanks. |
Beta Was this translation helpful? Give feedback.
-
The default recovery interval is 5 seconds. Connecting "immediately" makes no sense in practice. With some failure types on the client host that immediately, it can take a minute or more, and there's nothing the client can do about it. At this point we must mention publisher confirms and the fact that in the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the bug
We have an application running intended for high availability. However when a netwerk connection occurs, RabbitMQ and the clients do not fully recover and a manual restart of some clients is needed.
To be fair, I am unsure if this a bug in RabbitMQ.Client. Most likely we are missing something.
Reproduction steps
RabbitMQ node01 and node02 (both located on different servers in the same datacenter) lose the the connection with each other. Both keep the connection with node03. And all clients keep the connections with the node they are connected to. Since all 3 nodes remain active, haproxy does not drop connections.
We are unclear what the expected behavior in this case should be, since it is not really a network partition. At least not one, for which you can define a majority. Messages are no longer processed.
The cluster automatically recovers, which seems ok. But some clients do not re-subcribed to all the queues they were subscribed to before and, as a consequence, no longer handle the messages on these queues.
The moment of the failure was during low load. But during a period of slightly more than 1 minute publishing 375 messages of 2 clients have failt. How ever 6 of 12 clients need a restart to be correctly registered as consumers on all their queues again.
Expected behavior
Questions
Additional context
Application setup
RabbitMQ cluster with 3 nodes. All 3 nodes on different physical servers, 2 in DataCenter01 and 1 DataCenter02. We are using quorum queues. Approx. 350 exchange and 300 queues.
The client services are .net services that use MassTransit with RabbitMQ.Client for the RabbitMQ message publishing en consuming. The clients connect to RabbitMQ through HaProxy
Versions used
RabbitMQ 4.2.1
Erlang 26.2.5.3
Haproxy 3.0.11
MassTransit 8.5.5
RabbitMQ.Client 7.1.2
.Net 8
Beta Was this translation helpful? Give feedback.
All reactions