Clients do not re-subscribe to all queues after a cluster error #1916

PaulKVCare · 2026-03-06T10:38:57Z

PaulKVCare
Mar 6, 2026

Describe the bug

We have an application running intended for high availability. However when a netwerk connection occurs, RabbitMQ and the clients do not fully recover and a manual restart of some clients is needed.

To be fair, I am unsure if this a bug in RabbitMQ.Client. Most likely we are missing something.

Reproduction steps

RabbitMQ node01 and node02 (both located on different servers in the same datacenter) lose the the connection with each other. Both keep the connection with node03. And all clients keep the connections with the node they are connected to. Since all 3 nodes remain active, haproxy does not drop connections.
We are unclear what the expected behavior in this case should be, since it is not really a network partition. At least not one, for which you can define a majority. Messages are no longer processed.

The cluster automatically recovers, which seems ok. But some clients do not re-subcribed to all the queues they were subscribed to before and, as a consequence, no longer handle the messages on these queues.
The moment of the failure was during low load. But during a period of slightly more than 1 minute publishing 375 messages of 2 clients have failt. How ever 6 of 12 clients need a restart to be correctly registered as consumers on all their queues again.

Expected behavior

Questions

What would the expected RabbitMQ behavior be in case of this specific connection failure between to cluster nodes?
Which component (we assume RabbitMQ.client) is responsible for recovering the clients after the recovery of the failure.
Obviously our intended/expected behavior is that we don't have a failure period of a minute and that everything continues as if nothing happened.

Additional context

Application setup

RabbitMQ cluster with 3 nodes. All 3 nodes on different physical servers, 2 in DataCenter01 and 1 DataCenter02. We are using quorum queues. Approx. 350 exchange and 300 queues.
The client services are .net services that use MassTransit with RabbitMQ.Client for the RabbitMQ message publishing en consuming. The clients connect to RabbitMQ through HaProxy

Versions used

RabbitMQ 4.2.1
Erlang 26.2.5.3
Haproxy 3.0.11
MassTransit 8.5.5
RabbitMQ.Client 7.1.2
.Net 8

michaelklishin · 2026-03-06T15:27:24Z

michaelklishin
Mar 6, 2026
Maintainer

@PaulKVCare FYI, our team does not use issues for questions and you haven't provided any evidence of a bug: no reproduction steps, no logs, no traffic capture. We do not guess in this community.

For example, if you list only one endpoint in connection parameters, this client wouldn't have anywhere else to reconnect to, there is no cluster nodes discovery unlike in the stream client in the protocol.

There is a dedicated doc section on what exactly the connection recovery feature does in this client.

Several people have been trying to find a way to reproduce something similar in #1871. No one has provided any specific steps and the connection recovery feature does work as advertised most of the time, assuming that you list several endpoints when connecting.

Automatic connection recovery can also be tripped up by certain topology configurations, famously with exclusive queues with client-provided names which was mostly mediated only recently with a RabbitMQ change. There is a natural race condition between RabbitMQ deleting exclusive queues on client connection loss and recovering clients trying to re-declare them with the same name. Without logs from all nodes we cannot tell if that might be the case.

1 reply

michaelklishin Mar 6, 2026
Maintainer

Also, we won't be debugging MassTransit behavior. Some similar projects have or have had their own connection recovery (e.g. Spring AMQP used to), which might require you to disable the framework's recovery (or the client one's but then again, we will not troubleshoot MassTransit behavior).

lukebakken · 2026-03-06T15:32:47Z

lukebakken
Mar 6, 2026
Maintainer

RabbitMQ node01 and node02 (both located on different servers in the same datacenter) lose the the connection with each other. Both keep the connection with node03.

since it is not really a network partition.

Yes, actually, I would consider that to be a partition.

The short answer is that this library should re-connect successfully in a scenario like what you describe.

Neither I nor other members of Team RabbitMQ have the time to provide free support for a complex scenario such as this, unless you make it very, very easy to reproduce the exact behavior you describe. This project could be used as a starting point, since a docker-based cluster makes it easy-ish to simulate network issues. What we should be doing is spending time diagnosing your issue, not reproducing it.

As @michaelklishin noted, if you make your specific issue easy to reproduce it could potentially help other users of this library and RabbitMQ, and would be a major contribution to advancing both projects. Thanks.

1 reply

PaulKVCare Mar 6, 2026
Author

Thank you both for the quick response.
I was already in doubt if I should start the question here or with MassTransit. I completetly understand your point of view on a request like this. I would have liked too have a reproduction scenario for this as well. The case description was intended to get some direction for getting to that scenario. Both of you actually provided some new information that helps me.

Thank you.

michaelklishin · 2026-03-06T18:11:57Z

michaelklishin
Mar 6, 2026
Maintainer

Obviously our intended/expected behavior is that we don't have a failure period of a minute and that everything continues as if nothing happened

The default recovery interval is 5 seconds. Connecting "immediately" makes no sense in practice. With some failure types on the client host that immediately, it can take a minute or more, and there's nothing the client can do about it.

At this point we must mention publisher confirms and the fact that in the 7.x series there is a new async/await-based way of using them which makes their adoption much easier.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clients do not re-subscribe to all queues after a cluster error #1916

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clients do not re-subscribe to all queues after a cluster error #1916

Uh oh!

PaulKVCare Mar 6, 2026

Describe the bug

Reproduction steps

Expected behavior

Questions

Additional context

Application setup

Versions used

Replies: 3 comments · 2 replies

Uh oh!

michaelklishin Mar 6, 2026 Maintainer

Uh oh!

michaelklishin Mar 6, 2026 Maintainer

Uh oh!

lukebakken Mar 6, 2026 Maintainer

Uh oh!

PaulKVCare Mar 6, 2026 Author

Uh oh!

michaelklishin Mar 6, 2026 Maintainer

PaulKVCare
Mar 6, 2026

Replies: 3 comments 2 replies

michaelklishin
Mar 6, 2026
Maintainer

michaelklishin Mar 6, 2026
Maintainer

lukebakken
Mar 6, 2026
Maintainer

PaulKVCare Mar 6, 2026
Author

michaelklishin
Mar 6, 2026
Maintainer