Load balanced DRBD resource fails to reconnect after link failure on 9.2.12

I'm trying to setup a load-balanced-paths on two 1Gbe link on a two node cluster.

Versions:
- RHEL 9.4
- kernel: 5.14.0-427.50.1.el9_4 
- kmod: 9.2.12
- utils: 9.30.0

My setup consists of three DRBD replicated resources called mysql, pgsql and nfs. Load balancing seems to work fine until a link goes down. When this happens the resources drop to connecting state and most of the time fails to reconnect using the remaining link. Sometimes some of the resources manage to reconnect but usually not all of them.

Here are my resource files

[mysql.res.txt](https://github.com/user-attachments/files/18599150/mysql.res.txt)
[pgsql.res.txt](https://github.com/user-attachments/files/18599151/pgsql.res.txt)
[nfs.res.txt](https://github.com/user-attachments/files/18599149/nfs.res.txt)


So when all links are up, everything works fine.

```
[root@node1 ~]# drbdadm status
mysql role:Primary
  disk:UpToDate open:yes
  node2.example.net role:Secondary
    peer-disk:UpToDate

nfs role:Primary
  volume:2 disk:UpToDate open:yes
  node2.example.net role:Secondary
    volume:2 peer-disk:UpToDate

pgsql role:Secondary
  volume:1 disk:UpToDate open:no
  node2.example.net role:Primary
    volume:1 peer-disk:UpToDate
```

If I shutdown a link, resources drop to connecting state which I suppose is to be expected

```
[root@node1 ~]# ip link set dev ens15f2 down

[root@node1 ~]# drbdadm status
mysql role:Primary
  disk:UpToDate open:yes
  node2.example.net connection:Connecting

nfs role:Primary
  volume:2 disk:UpToDate open:yes
  node2.example.net connection:Connecting

pgsql role:Secondary
  volume:1 disk:UpToDate open:no
  node2.example.net connection:Connecting
```


However, the resource do not reconnect, this time only one of them managed to reconnect. Sometimes none of them. I would expect all of them to reconnect using the remaining link and just loose some bandwidth.

```
[root@ node1 ~]# drbdadm status
mysql role:Primary
  disk:UpToDate open:yes
  node2.example.net connection:Connecting

nfs role:Primary
  volume:2 disk:UpToDate open:yes
  node2.example.net role:Secondary
    volume:2 peer-disk:UpToDate

pgsql role:Secondary
  volume:1 disk:UpToDate open:no
  node2.example.net connection:Connecting
```


dmesg has following logging

```
[57154.551877] drbd nfs node2.example.net: meta connection shut down by peer.
[57154.552341] drbd nfs node2.example.net: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
[57154.552943] drbd nfs/2 drbd2 node2.example.net: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
[57154.553924] drbd nfs node2.example.net: Terminating sender thread
[57154.554402] drbd nfs node2.example.net: Starting sender thread (peer-node-id 2)
[57154.596904] drbd nfs/2 drbd2: new current UUID: A1497F592DF06783 weak: FFFFFFFFFFFFFFFD
[57154.616443] drbd nfs node2.example.net: Connection closed
[57154.616958] drbd nfs node2.example.net: helper command: /sbin/drbdadm disconnected
[57154.636357] drbd nfs node2.example.net: helper command: /sbin/drbdadm disconnected exit code 0
[57154.636798] drbd nfs node2.example.net: conn( NetworkFailure -> Unconnected ) [disconnected]
[57154.637213] drbd nfs node2.example.net: Restarting receiver thread
[57154.637623] drbd nfs node2.example.net: conn( Unconnected -> Connecting ) [connecting]
[57159.663869] drbd pgsql node2.example.net: meta connection shut down by peer.
[57159.664311] drbd pgsql node2.example.net: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown )
[57159.664722] drbd pgsql/1 drbd1: disk( UpToDate -> Consistent )
[57159.665317] drbd pgsql/1 drbd1 node2.example.net: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
[57159.665927] drbd pgsql/1 drbd1: Enabling local AL-updates
[57159.666822] drbd pgsql node2.example.net: Terminating sender thread
[57159.667339] drbd pgsql node2.example.net: Starting sender thread (peer-node-id 2)
[57159.685599] drbd pgsql: Preparing cluster-wide state change 1014439734: 1->all empty
[57159.686137] drbd pgsql: Committing cluster-wide state change 1014439734 (1ms)
[57159.686514] drbd pgsql/1 drbd1: disk( Consistent -> UpToDate ) [lost-peer]
[57159.710797] drbd pgsql node2.example.net: Connection closed
[57159.711201] drbd pgsql node2.example.net: helper command: /sbin/drbdadm disconnected
[57159.730586] drbd pgsql node2.example.net: helper command: /sbin/drbdadm disconnected exit code 0
[57159.731104] drbd pgsql node2.example.net: conn( NetworkFailure -> Unconnected ) [disconnected]
[57159.731512] drbd pgsql node2.example.net: Restarting receiver thread
[57159.731967] drbd pgsql node2.example.net: conn( Unconnected -> Connecting ) [connecting]
[57159.732398] drbd pgsql node2.example.net: Configured local address not found, retrying every 10 sec, err=-99
[57161.304690] drbd mysql node2.example.net: PingAck did not arrive in time.
[57161.305188] drbd mysql node2.example.net: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown )
[57161.305668] drbd mysql/0 drbd0 node2.example.net: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
[57161.306430] drbd mysql node2.example.net: Terminating sender thread
[57161.306935] drbd mysql node2.example.net: Starting sender thread (peer-node-id 2)
[57161.333782] drbd mysql/0 drbd0: new current UUID: C9C9B8ABE47DAFD7 weak: FFFFFFFFFFFFFFFD
[57161.334279] drbd mysql node2.example.net: Connection closed
[57161.334698] drbd mysql node2.example.net: helper command: /sbin/drbdadm disconnected
[57161.336153] drbd mysql node2.example.net: helper command: /sbin/drbdadm disconnected exit code 0
[57161.336668] drbd mysql node2.example.net: conn( NetworkFailure -> Unconnected ) [disconnected]
[57161.337245] drbd mysql node2.example.net: Restarting receiver thread
[57161.337770] drbd mysql node2.example.net: conn( Unconnected -> Connecting ) [connecting]
[57161.338269] drbd mysql node2.example.net: Configured local address not found, retrying every 10 sec, err=-99
[57165.265845] drbd nfs node2.example.net: Handshake to peer 2 successful: Agreed network protocol version 122
[57165.266197] drbd nfs node2.example.net: Feature flags enabled on protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
[57165.286681] drbd nfs: Preparing cluster-wide state change 3983717247: 1->2 role( Primary ) conn( Connected )
[57165.311689] drbd nfs/2 drbd2 node2.example.net: drbd_sync_handshake:
[57165.311989] drbd nfs/2 drbd2 node2.example.net: self A1497F592DF06783:8E6AF0CEB80F0099:4E70541A45A634F4:B6EB4E9DF4335DD4 bits:1 flags:120
[57165.312581] drbd nfs/2 drbd2 node2.example.net: peer 8E6AF0CEB80F0098:0000000000000000:B6EB4E9DF4335DD4:1AE271DBB86B2064 bits:0 flags:1120
[57165.313460] drbd nfs/2 drbd2 node2.example.net: uuid_compare()=source-use-bitmap by rule=bitmap-self
[57165.332761] drbd nfs: State change 3983717247: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFF9
[57165.333139] drbd nfs: Committing cluster-wide state change 3983717247 (46ms)
[57165.333539] drbd nfs node2.example.net: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) [connected]
[57165.333964] drbd nfs/2 drbd2 node2.example.net: pdsk( DUnknown -> Consistent ) repl( Off -> WFBitMapS ) [connected]
[57165.341148] drbd nfs/2 drbd2 node2.example.net: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), total 28; compression: 100.0%
[57165.341931] drbd nfs/2 drbd2 node2.example.net: pdsk( Consistent -> Outdated ) [peer-state]
[57165.379881] drbd nfs/2 drbd2 node2.example.net: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 28(1), total 28; compression: 100.0%
[57165.380567] drbd nfs/2 drbd2 node2.example.net: helper command: /sbin/drbdadm before-resync-source
[57165.399999] drbd nfs/2 drbd2 node2.example.net: helper command: /sbin/drbdadm before-resync-source exit code 0
[57165.400421] drbd nfs/2 drbd2 node2.example.net: pdsk( Outdated -> Inconsistent ) repl( WFBitMapS -> SyncSource ) [receive-bitmap]
[57165.401386] drbd nfs/2 drbd2 node2.example.net: Began resync as SyncSource (will sync 4 KB [1 bits set]).
[57165.704333] drbd nfs/2 drbd2 node2.example.net: updated UUIDs A1497F592DF06783:0000000000000000:8E6AF0CEB80F0098:4E70541A45A634F4
[57165.760651] drbd nfs/2 drbd2 node2.example.net: Resync done (total 1 sec; paused 0 sec; 4 K/sec)
[57165.761224] drbd nfs/2 drbd2 node2.example.net: pdsk( Inconsistent -> UpToDate ) repl( SyncSource -> Established ) [resync-finished]
```

If I then restore the link, resource connect and sync normally

```
[root@node1 ~]# ip link set dev ens15f2 up

[root@ node1 ~]# drbdadm status
mysql role:Primary
  disk:UpToDate open:yes
  node2.example.net role:Secondary
    peer-disk:UpToDate

nfs role:Primary
  volume:2 disk:UpToDate open:yes
  node2.example.net role:Secondary
    volume:2 peer-disk:UpToDate

pgsql role:Secondary
  volume:1 disk:UpToDate open:no
  node2.example.net role:Primary
    volume:1 peer-disk:UpToDate
```

Am I missing some configuration here or is this a bug ? One thing I noticed is that once the link goes down NetworkManager removes the IP address from the NIC. There are dmesg entries about missing address

```
DMESG: 

Configured local address not found, retrying every 10 sec, err=-99


UP:

8: ens15f2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 6c:92:cf:0f:bc:54 brd ff:ff:ff:ff:ff:ff
    altname enp108s0f2
    inet 192.168.62.2/29 brd 192.168.62.7 scope global noprefixroute ens15f2
       valid_lft forever preferred_lft forever
    inet6 fe80::6e92:cfff:fe0f:bc54/64 scope link 
       valid_lft forever preferred_lft forever

DOWN:

8: ens15f2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 6c:92:cf:0f:bc:54 brd ff:ff:ff:ff:ff:ff
    altname enp108s0f2
```

When taking these captures, two out of three resources managed to connect so it varies randomly.

```
[root@node1 ~]# drbdadm status
mysql role:Primary
  disk:UpToDate open:yes
  node2.example.net connection:Connecting

nfs role:Primary
  volume:2 disk:UpToDate open:yes
  node2.example.net role:Secondary
    volume:2 peer-disk:UpToDate

pgsql role:Secondary
  volume:1 disk:UpToDate open:no
  node2.example.net role:Primary
    volume:1 peer-disk:UpToDate
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Load balanced DRBD resource fails to reconnect after link failure on 9.2.12 #106

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Load balanced DRBD resource fails to reconnect after link failure on 9.2.12 #106

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions