Describe the bug
We have an application running intended for high availability. However when a netwerk connection occurs, RabbitMQ and the clients do not fully recover and a manual restart of some clients is needed.
To be fair, I am unsure if this a bug in RabbitMQ.Client. Most likely we are missing something.
Reproduction steps
RabbitMQ node01 and node02 (both located on different servers in the same datacenter) lose the the connection with each other. Both keep the connection with node03. And all clients keep the connections with the node they are connected to. Since all 3 nodes remain active, haproxy does not drop connections.
We are unclear what the expected behavior in this case should be, since it is not really a network partition. At least not one, for which you can define a majority. Messages are no longer processed.
The cluster automatically recovers, which seems ok. But some clients do not re-subcribed to all the queues they were subscribed to before and, as a consequence, no longer handle the messages on these queues.
The moment of the failure was during low load. But during a period of slightly more than 1 minute publishing 375 messages of 2 clients have failt. How ever 6 of 12 clients need a restart to be correctly registered as consumers on all their queues again.
Expected behavior
Questions
- What would the expected RabbitMQ behavior be in case of this specific connection failure between to cluster nodes?
- Which component (we assume RabbitMQ.client) is responsible for recovering the clients after the recovery of the failure.
- Obviously our intended/expected behavior is that we don't have a failure period of a minute and that everything continues as if nothing happened.
Additional context
Application setup
RabbitMQ cluster with 3 nodes. All 3 nodes on different physical servers, 2 in DataCenter01 and 1 DataCenter02. We are using quorum queues. Approx. 350 exchange and 300 queues.
The client services are .net services that use MassTransit with RabbitMQ.Client for the RabbitMQ message publishing en consuming. The clients connect to RabbitMQ through HaProxy
Versions used
RabbitMQ 4.2.1
Erlang 26.2.5.3
Haproxy 3.0.11
MassTransit 8.5.5
RabbitMQ.Client 7.1.2
.Net 8
Describe the bug
We have an application running intended for high availability. However when a netwerk connection occurs, RabbitMQ and the clients do not fully recover and a manual restart of some clients is needed.
To be fair, I am unsure if this a bug in RabbitMQ.Client. Most likely we are missing something.
Reproduction steps
RabbitMQ node01 and node02 (both located on different servers in the same datacenter) lose the the connection with each other. Both keep the connection with node03. And all clients keep the connections with the node they are connected to. Since all 3 nodes remain active, haproxy does not drop connections.
We are unclear what the expected behavior in this case should be, since it is not really a network partition. At least not one, for which you can define a majority. Messages are no longer processed.
The cluster automatically recovers, which seems ok. But some clients do not re-subcribed to all the queues they were subscribed to before and, as a consequence, no longer handle the messages on these queues.
The moment of the failure was during low load. But during a period of slightly more than 1 minute publishing 375 messages of 2 clients have failt. How ever 6 of 12 clients need a restart to be correctly registered as consumers on all their queues again.
Expected behavior
Questions
Additional context
Application setup
RabbitMQ cluster with 3 nodes. All 3 nodes on different physical servers, 2 in DataCenter01 and 1 DataCenter02. We are using quorum queues. Approx. 350 exchange and 300 queues.
The client services are .net services that use MassTransit with RabbitMQ.Client for the RabbitMQ message publishing en consuming. The clients connect to RabbitMQ through HaProxy
Versions used
RabbitMQ 4.2.1
Erlang 26.2.5.3
Haproxy 3.0.11
MassTransit 8.5.5
RabbitMQ.Client 7.1.2
.Net 8