Bundle of fixes: NBD lifecycle, snapshot cleanup, vzdump consistency, taint-mode by arki05 · Pull Request #63 · moosefs/pve-moosefs

arki05 · 2026-04-17T01:33:47Z

Hey - opening this as one big PR but very happy to split it up however you'd prefer (per-issue, per-fix, whatever's easier to review). Also fine with closing it and re-opening just the parts you want to take.

What's in here

A bunch of fixes that accumulated in my fork (arki05/pve-moosefs) over the last few days while I was hammering on the plugin in a test environment. They roughly cluster around the issues already filed:

Major corruption while moving LXC storage to MooseFS. Errors not cancelling 'move storage' tasks. #50, [bug] After live migration, mfsbdev VMs and disks sometimes revert to non-bdev temporarily #47- NBD lifecycle hardening (propagate unmap errors, refuse to delete the source file while a stale device still references it, verify state before/after operations)
warning: "PVE::Storage::Custom::MooseFSPlugin" is implementing an older storage API, an upgrade is recommended #56- bump storage API to v13 for PVE 9
[Bug] NBD Backend: pct resize multiplies target size by 1024 (GB becomes TB) #57- volume_resize was treating $size as KiB instead of bytes
[bug] pct fsck fails with "wrong syntax" on LXC stored on MooseFS #55- filesystem_path not honoring wantarray
Snapshot NBD breaks nbd #58- live-snapshot mixed-mode handling for raw mfsbdev
Restarting Proxmox pvestatd will SIGKILL mfsmount leading to potentially data corruption #60 - pvestatd KillMode=mixed systemd drop-in, so systemctl restart pvestatd no longer SIGKILLs spawned mfsmount helpers
[bug] LXC backups don't work in snapshot mode (bdev specific) #48 (and a few unfiled) - free_image now cleans up {vmid}/snaps/* (PVE's destroy_vm calls free_image but never volume_snapshot_delete, so snapshot files were leaking and breaking VMID reuse). Same code path
also untaints inputs via the existing _untaint_vol helper because pvedaemon runs with -T and was silently dying on unlink.
Post-unmap NBD device health check (blockdev --rereadpt + settle + force-unmap fallback) — the kernel doesn't always release the device cleanly, leaving stale state that breaks the next consumer of the same
/dev/nbdN.
sync before mfsmakesnapshot — without it, vzdump snapshot-mode backups of running LXCs fail with EBADMSG because the ext4 journal inside the raw image hasn't been committed yet.

How I tested

Built a nested PVE 9 / Debian 13 cluster harness with a pytest suite covering snapshot/migrate/resize/backup/move on VMs and LXCs, including chained operations (snapshot-under-fio, online resize under load, concurrent
snapshots on multiple VMs). Source is at arki05/pve-storage-test-lab if you want to reproduce or extend it.

Results on a freshly-built lab:

Plugin	Passed	Failed	Skipped
upstream `moosefs/pve-moosefs main`	13	38	0
this PR	51	0	0

The upstream failures cascade off a single bug (cloudinit.raw not removed on destroy → next clone hits "File exists"), so 38 isn't really 38 distinct bugs. But the harness exercises a lot of paths that PVE plugins
typically get wrong, and this PR cleans them up.

Where I'd like your eyes

The post-unmap escalation (flushbufs → rereadpt → settle → force-unmap) is best-effort; if you'd rather have a more conservative version (just detect + warn, don't force) I can split that out.
The sync before mfsmakesnapshot is system-wide. If you'd rather scope it to the affected file only (e.g., fdatasync on the open fd, or via mfsbdev) I'm happy to redo.

Thanks for building this plugin - it's been a great base to iterate on. Happy to take whatever feedback you have, in whatever form.

The mfsnbdlink config option is honored by map/unmap/list operations throughout the plugin, but was ignored by the two functions that manage the mfsbdev daemon lifecycle: - moosefs_bdev_is_active hardcoded /dev/mfs/nbdsock, so once any storage started a daemon on the default socket, every other storage's active check returned true and skipped its own daemon startup. - moosefs_start_bdev never passed -l to mfsbdev start, so daemons always bound to the default socket regardless of per-storage configuration. With multiple MooseFS storages pointing at different mfssubfolders, only one daemon would actually start (rooted at the first storage's subfolder) and subsequent map calls for other storages failed with "error opening MFS file ...: No such file or directory" because the map was looking up paths in the wrong subfolder view. Fix both call sites to honor $scfg->{mfsnbdlink} so each storage gets its own properly-rooted mfsbdev daemon on its own socket.

…sefs#47)

…oosefs#55)

…active/start

… SUPER

…efs#60)

…osefs#50, moosefs#47)

…oosefs#58)

…shot (moosefs#58)

LXC containers use a loop device on top of the NBD device (loop0 -> nbd0). The existing unmap/remap flow in with_nbd_unmapped and volume_resize would orphan the loop device, causing I/O errors and data corruption in running containers. Changes: - Add nbd_has_loop_holder(): checks /sys/block/loop*/loop/backing_file to detect loop devices backed by an NBD device - with_nbd_unmapped: skip the unmap when a loop holder is detected and run the operation directly (mfsmakesnapshot is a metadata-only COW op) - volume_resize: use mfsbdev resize (in-place) instead of unmap/truncate/remap when a loop holder is detected, keeping the loop->nbd chain intact Tested: LXC create, resize (running), snapshot create/delete/rollback (running), snapshot+resize+rollback combo — all pass without data corruption.

alloc_image: Remove the eager mfsbdev map call. The NBD mapping now happens lazily via activate_volume -> map_volume when the volume is first used. Eager mapping caused symlink collisions in /dev/mfs/ during disk moves between pools on the same MooseFS master — the source volume's link name clashed with the destination allocation. nbd_device_holder_pid: Skip systemd-udevd alongside mfsbdev in the fd holder scan. udevd transiently opens block devices for probing and is not a real consumer — its presence blocked snapshot cleanup on running containers (vzdump 'vzdump' snapshot deletion failed). Tested: disk move between moosefs-standard and moosefs-fast succeeds, vzdump --mode snapshot completes with clean snapshot cleanup.

Five call sites used bare 'mfsbdev list' without -l, querying the default socket instead of the per-storage socket. This caused operations (filesystem_path, volume_resize, volume_export, map_volume, with_nbd_unmapped) to fail silently on multi-storage setups where each pool has its own mfsbdev daemon. Also fix moosefs_start_bdev to mkdir the socket parent directory (e.g. /dev/mfs/) before launching mfsbdev, and export make_path from File::Path to avoid 'undefined subroutine' at runtime.

Change nbd_device_holder_pid from a fatal refusal to a skip-unmap. mfsmakesnapshot operates at the MooseFS metadata level, not through the NBD block device, so it's safe to run while QEMU or a loop device holds the device open. Unmapping under a live consumer would corrupt its disk view — skipping the unmap is both safer and required for PVE's 'mixed' snapshot mode where QEMU expects the storage plugin to snapshot while the VM is running.

…bdev 'mixed' mode triggers QEMU external snapshots with backing chains, which requires qcow2. For raw/host_device (NBD), QEMU rejects the 'backing' parameter in blockdev-add. Return undef instead so PVE uses 'storage' mode: briefly pause the VM, call volume_snapshot (mfsmakesnapshot COW), then resume. The pause is typically sub-second for metadata-only ops.

…oosefs#57, moosefs#55)

…statd drop-in (moosefs#60)

…lazy alloc (moosefs#50, moosefs#47, moosefs#58)

Major simplification of the NBD lifecycle during snapshot operations: - Remove with_nbd_unmapped, nbd_device_holder_pid, nbd_has_loop_holder, rename_snapshot, get_mfsbdev_size — all dead or unnecessary code. mfsmakesnapshot is a metadata-only COW op safe while NBD is in use. - volume_resize: always use truncate + mfsbdev resize (in-place). - volume_qemu_snapshot_method: return undef (storage mode) instead of 'mixed'. QEMU's blockdev-add rejects 'backing' on raw/host_device. - volume_snapshot_delete: use unlink instead of mfsrmsnapshot. COW copies are regular files, not MooseFS internal snapshot references. - Replace 3 inline mfsbdev list parsers with moosefs_bdev_list_mappings. Remove redundant double list query in deactivate_volume. - Fix vzdump --mode snapshot: path() and activate_volume now skip NBD for snapshot volumes (accessed via FUSE). map_volume returns undef for snapshots as defense-in-depth. Net: -400 lines, vzdump snapshot backups now work end-to-end.

Add a sweep in activate_storage that unmaps NBD devices whose backing file no longer exists. This catches leaked mappings from: - Race conditions (concurrent qm stop + qm destroy) - Crashed daemons or interrupted operations - Manual file deletions Runs on each pvestatd activation cycle and during storage re-activation after restart. Only unmaps when the backing file is confirmed gone — never touches active volumes.

When all NBD devices are in use and map_volume falls back to the FUSE path, emit a warn so it shows up in syslog/journalctl. Previously this was a silent fallback — the user had no way to know their VM was running on FUSE instead of NBD.

Snapshot operations (create, delete, rollback) have been tested and confirmed working for both running VMs and LXCs. The Oct 2025 mfsbdev snapshotting issue has been resolved across multiple commits (moosefs#53, moosefs#58).

Integration/all fixes

Two fixes: 1. Snapshot cleanup on destroy (taint-safe): Extract _cleanup_vmid_residuals helper that walks {vmid}/snaps/* and removes snapshot files matching the volume being freed. PVE's destroy_vm only calls free_image (not volume_snapshot_delete), so file-based snapshot layouts leak on destroy. All path components are regex-untainted for pvedaemon's -T mode — without this, unlink/rmdir die with "Insecure dependency" which is silently caught by the eval in destroy_vm. 2. Post-unmap NBD device health check: After mfsbdev unmap succeeds, verify via /sys/block/nbdN/size that the kernel device is truly released. If nonzero (stale state from buffered I/O errors or incomplete teardown), flush buffers and force-unmap by device path. Without this, the same /dev/nbdN gets reused for the next VM with stale kernel state, causing mkfs failures, EBADMSG reads, and I/O errors — flaky failures that only manifest after many map/unmap cycles in a test suite.

The existing pvestatd drop-in (issue moosefs#60) only covers one of three PVE daemons that can spawn mfsbdev / mfsmount. When a user restarts pvedaemon or pveproxy (common after plugin upgrade), the default KillMode=control-group SIGKILLs every process in the cgroup — including any mfsbdev/mfsmount that activate_storage parented there. Killing mfsbdev mid-flight has two downstream effects: - FUSE mounts get "Transport endpoint is not connected" - MFS-level file locks leak, because mfsbdev never sends unmap. The next map attempt then fails with "MFS file is locked", forcing the plugin onto the slower FUSE fallback path. Shipping the same drop-in for pvedaemon and pveproxy closes both paths. postinst/prerm are generalised to loop over all three service.d dirs.

When mfsbdev is killed without a clean shutdown (SIGKILL from systemd's control-group KillMode, OOM kill, hard daemon restart, etc.) the MFS master keeps the file lock that mfsbdev held. The next mapping attempt then fails with "MFS file is locked (likely mapped elsewhere)" even though no mfsbdev is actually holding it. The existing retry loop only handles "link exists" races. On the lock error it falls through all 5 attempts and drops to FUSE fallback, which ruins performance until the lock eventually ages out (or never, if the master doesn't age session-bound locks). Handle this case: on the first "is locked" error, log a warning pointing at the likely cause and retry with -i (ignore_locks). We deliberately do NOT enable -i speculatively — if the error comes from a genuine concurrent mapping, we still want it to fail loudly.

If a MooseFS-backed volume is resized out of band — offline truncate via losetup, qemu-img resize against the FUSE path, manual recovery work like the disk-shrink flow — the underlying file picks up the new size but any live mfsbdev mapping keeps caching the length it was mapped with. The plugin then happily returns that /dev/nbdN to callers, who read the stale geometry. Concretely: `move disk` after a 2T→64G shrink still tries to copy 2T, and `qemu-img convert` either over-copies junk or truncates the destination — both of which we hit in practice. Add a tolerant helper, `moosefs_bdev_sync_size`, that reads the cached size out of `mfsbdev list`, compares it to the on-FUSE file size, and calls `mfsbdev resize` if they drift. Invoked from map_volume on the existing-mapping early-return path, before the device is handed back. The helper is deliberately best-effort: list/parse failures return success so we don't turn a diagnostic improvement into a new failure mode. Only a real `mfsbdev resize` failure surfaces, and even then map_volume still returns the device — the warn() makes the drift visible in task logs.

map_volume's contract is to return a usable path (filesystem or /dev/nbdN) that callers hand straight to QEMU/swtpm/qemu-img. Several early-return branches instead return \$class->SUPER::activate_volume(...), whose scalar value is undef/empty — it only exists to die on missing files, not to identify the volume. The pattern was harmless while it only covered paths that never fire in practice (non-images/non-raw vtypes, undefined volname). When the small-disk NBD skip was added for files <= 8 MiB it reused the same return — and that branch fires on every VM start for TPM state and other small volumes. start_swtpm reads the return as a state URL and ends up running: swtpm_setup --tpmstate file:// ... which fails with exit 1 and blocks VM start entirely. Legacy-named TPM state files (vm-VMID-disk-N.raw rather than vm-VMID-tpmstate-N.raw) miss the earlier /^vm-\d+-tpmstate/ shortcut, so they land on this broken small-disk path. EFI survives because QemuServer routes EFI through a different code path. Fix: return \$class->filesystem_path(...) from the small-disk and non-raw/non-images branches, and return undef from the undefined-volname branch (which has no meaningful path to hand back anyway).

The KillMode=mixed drop-ins for pvestatd/pvedaemon/pveproxy close the hole where a daemon restart would SIGKILL mfsbdev, but only after the drop-in has taken effect — which itself requires a daemon restart. Upshot: the very first `systemctl restart <pve-daemon>` after upgrading the plugin (or a routine restart on a node that installed the plugin before the drop-in was available) still kills mfsbdev under the old KillMode=control-group. Kill mfsbdev mid-flight and the damage is real: the master-side session never gets to send unmap for its open NBD handles, those handles become "sustained" files on the master, and the chunks they reference never get reclaimed. A node that runs a few VM create/destroy cycles this way fills its MooseFS pool to 100% even though /mnt/<store>/images is empty — we tripped this tonight running the pve-storage-test-lab. Remove the parent-cgroup entanglement entirely: wrap the mfsbdev start command in a transient `systemd-run --scope` unit pinned to system.slice. Now mfsbdev lives in its own scope under the system slice, independent of whichever PVE daemon first called activate_storage. Any `systemctl restart` against pvestatd / pvedaemon / pveproxy is now harmless to the MooseFS bridge. The scope name is deterministic per-daemon (derived from the mfsnbdlink basename) so a subsequent call finds the running scope instead of spawning a second one; the "unit already exists" path is handled the same way as the existing "socket already in use" branch. The KillMode drop-ins stay — defense in depth, and they catch the degenerate case of some out-of-tree code spawning mfsbdev without going through this function.

arki05 · 2026-04-18T11:32:29Z

FYI - there is still minor issues i'm finding and fixing ( stale nbd things when updating / installing etc ) - but working through them - will push them here as well. ( also some bugs with tpm devices in some cases )

…oxy" This reverts commit a8321f1.

`systemd-run --scope` moves mfsbdev into its own cgroup (so a restart of pvedaemon/pvestatd/pveproxy no longer kills it), but the child still inherits every open file descriptor from whichever PVE daemon happened to call activate_storage first. pvestatd keeps an exclusive advisory lock on /run/pvestatd.pid.lock; pvedaemon and pveproxy do the same for their own pid.lock files. When mfsbdev inherits that fd and then outlives its spawning daemon's restart, the daemon cannot re-acquire its own lock on the next start: pvestatd[...]: start failed - can't acquire lock '/var/run/pvestatd.pid.lock' - Resource temporarily unavailable The daemon ends up in a permanent "failed" state, and since pmxcfs depends on pvestatd the whole cluster on that node unravels. Wrap the mfsbdev start command in a shell that closes fds 3+ before exec'ing. stdin/stdout/stderr (0/1/2) are kept — systemd-run redirects those to the journal — so logging still works.

Five fixes, all verified by a full run of arki05/pve-storage-test-lab against this branch: * systemd-run scope for mfsbdev — mfsbdev now lives in its own transient scope under system.slice, so restarting any PVE daemon (pvestatd/pvedaemon/pveproxy) can no longer SIGKILL it. Killed mfsbdev was the root cause of "sustained" MooseFS file handles piling up to 100% pool fill after a handful of VM lifecycles. * Close inherited fds before exec'ing mfsbdev — companion to the scope change. Without this mfsbdev would inherit pvestatd's /run/pvestatd.pid.lock fd and prevent pvestatd from re-locking on restart, taking the whole node's cluster stack down. * map_volume: return a real filesystem path from the small-disk skip branch instead of SUPER::activate_volume's undef-ish return value. Fixes VM start for configurations with ovmf + tpmstate0, which otherwise got "swtpm_setup --tpmstate file://" with an empty URL and bailed out before boot. * map_volume: detect and repair NBD size drift when reusing an existing mapping. Fixes move-disk / clone after an out-of-band resize via the FUSE path (e.g. offline qemu-img resize or a losetup+truncate disk-shrink workflow). * map_volume: recover from stale MFS file locks via -i retry, with a warn() pointing at the likely cause. Previously the retry loop only handled "link exists" and would drop to the FUSE fallback on "is locked", ruining performance until the master eventually aged the lock out.

arki05 added 30 commits April 11, 2026 02:09

Fix volume_resize treating $size as KiB instead of bytes (moosefs#57)

330b465

Bump storage plugin API to v13 (moosefs#56)

c7a05bf

Propagate $hints/$cache through activate_volume and map_volume

d595717

Refuse NBD unmap during snapshot if device is in use (moosefs#58)

ef0b69b

Honor wantarray in filesystem_path / path returns (moosefs#55)

b7ce010

Ship pvestatd KillMode=mixed systemd drop-in (moosefs#60)

f412688

Harden NBD lifecycle: propagate errors, verify state (moosefs#50, moo…

59077aa

…sefs#47)

Declare 'mixed' snapshot mode and implement rename_snapshot (moosefs#58)

1149158

nbd_device_holder_pid: skip mfsbdev (server-side fd)

dcda83d

with_nbd_unmapped: retry remap on transient MooseFS metadata drift

8d9e991

Merge fix/bump-storage-api-v13: bump api() to v13 (moosefs#56)

1a82cc3

Merge fix/volume-resize-size-units: treat $size as bytes (moosefs#57)

0ed334c

Merge fix/filesystem-path-wantarray: honor wantarray in path returns (m…

d50f242

…oosefs#55)

Merge fix/mfsnbdlink-per-storage-daemon: honor mfsnbdlink in bdev is_…

8c263cb

…active/start

Merge fix/activate-volume-hints-signature: propagate $hints/$cache to…

97968a1

… SUPER

Merge fix/pvestatd-killmode-dropin: ship KillMode=mixed drop-in (moos…

b38386f

…efs#60)

Merge fix/nbd-lifecycle-hardening: propagate errors, verify state (mo…

a03d926

…osefs#50, moosefs#47)

Merge fix/snapshot-refuse-on-running-vm: refuse unmap on held device (m…

4914c6f

…oosefs#58)

Merge fix/live-snapshot-mixed-mode: mixed snapshot mode + rename_snap…

6d59850

…shot (moosefs#58)

Merge pr/trivia-fixes: api v13, resize bytes, wantarray (moosefs#56, m…

da0296d

…oosefs#57, moosefs#55)

Merge pr/per-storage-daemon: mfsnbdlink fixes, hints propagation, pve…

ced4afa

…statd drop-in (moosefs#60)

Merge pr/nbd-safety: lifecycle hardening, snapshot safety, LXC loop, …

1fffb27

…lazy alloc (moosefs#50, moosefs#47, moosefs#58)

arki05 added 4 commits April 12, 2026 22:43

Log warning on NBD exhaustion FUSE fallback

21001f5

When all NBD devices are in use and map_volume falls back to the FUSE path, emit a warn so it shows up in syslog/journalctl. Previously this was a silent fallback — the user had no way to know their VM was running on FUSE instead of NBD.

Merge pull request #6 from arki05/integration/all-fixes

bc60833

Integration/all fixes

pkonopelko requested a review from Zorlin April 17, 2026 13:04

arki05 added 5 commits April 18, 2026 03:38

arki05 added 3 commits April 18, 2026 16:56

Revert "systemd: extend KillMode=mixed drop-in to pvedaemon and pvepr…

627cbb5

…oxy" This reverts commit a8321f1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bundle of fixes: NBD lifecycle, snapshot cleanup, vzdump consistency, taint-mode#63

Bundle of fixes: NBD lifecycle, snapshot cleanup, vzdump consistency, taint-mode#63
arki05 wants to merge 42 commits intomoosefs:mainfrom
arki05:main

arki05 commented Apr 17, 2026

Uh oh!

arki05 commented Apr 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arki05 commented Apr 17, 2026

What's in here

How I tested

Where I'd like your eyes

Uh oh!

arki05 commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arki05 commented Apr 18, 2026 •

edited

Loading