Skip to content

udevd event processing race - satellite reports volume ready before /dev/drbd/by-res/XXX link is created) #456

@lukasz-e

Description

@lukasz-e

I have proxmox + linstor + linstor-proxmox plugin environment.
linstor: 1.31.1-1
proxmox: 8.4.0
linstor-proxmox: 8.1.1-1

When vm disk is linstor volume (1 replica drbd) for which local (pve node where vm is registered ) node is diskless, attempt to move such a disk to another storage when vm is powered off fails.
proxmox says:

blockdev: cannot open /dev/drbd/by-res/pm-901dd2b4/0: No such file or directory
create full clone of drive scsi0 (ext-hdd-repl1:pm-901dd2b4_10017)
failed to stat '/dev/drbd/by-res/pm-901dd2b4/0'
Use of uninitialized value $size in division (/) at /usr/share/perl5/PVE/QemuServer.pm line 8491.

NOTICE
  Trying to create diskful resource (pm-c6380af5) on (cirrus1).
Use of uninitialized value $size in multiplication (*) at /usr/share/perl5/PVE/QemuServer.pm line 8084.
Use of uninitialized value $value in numeric gt (>) at /usr/share/perl5/PVE/Format.pm line 74.
Use of uninitialized value $value in sprintf at /usr/share/perl5/PVE/Format.pm line 79.
transferred 0.0 B of 0.0 B (0.00%)
qemu-img: output file is smaller than input file
TASK ERROR: storage migration failed: copy failed: command '/usr/bin/qemu-img convert -p -n -f raw -O raw /dev/drbd/by-res/pm-901dd2b4/0 zeroinit:/dev/drbd/by-res/pm-c6380af5/0' failed: exit code 1

From host logs:

May 29 17:26:42 cirrus1 Satellite[3348808]: 2025-05-29 17:26:42.380 [DeviceManager] INFO  LINSTOR/Satellite/3bb565 SYSTEM - DRBD regenerated resource file: /var/lib/linstor.d/pm-901dd2b4.res
May 29 17:26:42 cirrus1 kernel: drbd pm-901dd2b4: Starting worker thread (node-id 1)
May 29 17:26:42 cirrus1 kernel: drbd pm-901dd2b4 superstorage: Starting sender thread (peer-node-id 0)
May 29 17:26:42 cirrus1 kernel: drbd pm-901dd2b4 superstorage: conn( StandAlone -> Unconnected ) [connect]
May 29 17:26:42 cirrus1 kernel: drbd pm-901dd2b4 superstorage: Starting receiver thread (peer-node-id 0)
May 29 17:26:42 cirrus1 kernel: drbd pm-901dd2b4 superstorage: conn( Unconnected -> Connecting ) [connecting]
May 29 17:26:42 cirrus1 Satellite[3348808]: 2025-05-29 17:26:42.395 [DeviceManager] INFO  LINSTOR/Satellite/3bb565 SYSTEM - Resource 'pm-901dd2b4' [DRBD] adjusted.
May 29 17:26:42 cirrus1 Satellite[3348808]: 2025-05-29 17:26:42.401 [DeviceManager] INFO  LINSTOR/Satellite/ SYSTEM - End DeviceManager cycle 902
May 29 17:26:42 cirrus1 Satellite[3348808]: 2025-05-29 17:26:42.401 [DeviceManager] INFO  LINSTOR/Satellite/c218b0 SYSTEM - Begin DeviceManager cycle 903
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4 superstorage: Handshake to peer 0 successful: Agreed network protocol version 122
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4 superstorage: Feature flags enabled on protocol level: 0x7f TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES RESYNC_DAGTAG
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4 superstorage: Peer authenticated using 20 bytes HMAC
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4: Preparing remote state change 828630895: 0->1 role( Secondary ) conn( Connected )
May 29 17:26:43 cirrus1 kernel: drbd1078: detected capacity change from 0 to 272629760
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4/0 drbd1078: size = 130 GB (136314880 KB)
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4/0 drbd1078 superstorage: my exposed UUID: 0000000000000000
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4/0 drbd1078 superstorage: peer 1658BD60BA1EFD68:FFFFFFFFFFFFFFFF:0000000000000000:0000000000000000 bits:0 flags:20
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4/0 drbd1078: Setting exposed data uuid: 1658BD60BA1EFD68
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4 superstorage: Committing remote state change 828630895 (primary_nodes=0)
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4 superstorage: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) [remote]
May 29 17:26:43 cirrus1 kernel: drbd pm-901dd2b4/0 drbd1078 superstorage: pdsk( DUnknown -> UpToDate ) repl( Off -> Established ) [remote]
May 29 17:26:43 cirrus1 pvedaemon[722348]: failed to stat '/dev/drbd/by-res/pm-901dd2b4/0'
May 29 17:26:43 cirrus1 pvedaemon[722348]: Use of uninitialized value $size in division (/) at /usr/share/perl5/PVE/QemuServer.pm line 8491.

linstor volume info:

root@cirrus1:~# linstor v l |grep pm-901dd2b4
| pm-901dd2b4 | cirrus1      | DfltDisklessStorPool |     0 |    1078 | /dev/drbd1078 |            | Unused |   Diskless | Established(1) |
| pm-901dd2b4 | superstorage | external_hdd_thin    |     0 |    1078 | /dev/drbd1078 |  19.52 GiB | Unused |   UpToDate | Established(1) |

It looks like udevd did not manage to create drbd device link(s) (/dev/drbd/by-res/XXX) "in time" (before proxmox tries to use it).
I think satellite should wait until udevd creates links for drbd device (using "udevadm settle" ?) before returning success status to controller in this situation.

For now I managed to work-around the problem adding "udevadm settle" system call in linstor-proxmox plugin code (after calling controller in activate_volume function),

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions