Skip to content

Clients do not re-subscribe to all queues after a cluster error #1915

@PaulKVCare

Description

@PaulKVCare

Describe the bug

We have an application running intended for high availability. However when a netwerk connection occurs, RabbitMQ and the clients do not fully recover and a manual restart of some clients is needed.

To be fair, I am unsure if this a bug in RabbitMQ.Client. Most likely we are missing something.

Reproduction steps

RabbitMQ node01 and node02 (both located on different servers in the same datacenter) lose the the connection with each other. Both keep the connection with node03. And all clients keep the connections with the node they are connected to. Since all 3 nodes remain active, haproxy does not drop connections.
We are unclear what the expected behavior in this case should be, since it is not really a network partition. At least not one, for which you can define a majority. Messages are no longer processed.

The cluster automatically recovers, which seems ok. But some clients do not re-subcribed to all the queues they were subscribed to before and, as a consequence, no longer handle the messages on these queues.
The moment of the failure was during low load. But during a period of slightly more than 1 minute publishing 375 messages of 2 clients have failt. How ever 6 of 12 clients need a restart to be correctly registered as consumers on all their queues again.

Expected behavior

Questions

  • What would the expected RabbitMQ behavior be in case of this specific connection failure between to cluster nodes?
  • Which component (we assume RabbitMQ.client) is responsible for recovering the clients after the recovery of the failure.
  • Obviously our intended/expected behavior is that we don't have a failure period of a minute and that everything continues as if nothing happened.

Additional context

Application setup

RabbitMQ cluster with 3 nodes. All 3 nodes on different physical servers, 2 in DataCenter01 and 1 DataCenter02. We are using quorum queues. Approx. 350 exchange and 300 queues.
The client services are .net services that use MassTransit with RabbitMQ.Client for the RabbitMQ message publishing en consuming. The clients connect to RabbitMQ through HaProxy

Versions used

RabbitMQ 4.2.1
Erlang 26.2.5.3
Haproxy 3.0.11
MassTransit 8.5.5
RabbitMQ.Client 7.1.2
.Net 8

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions