Bundle of fixes: NBD lifecycle, snapshot cleanup, vzdump consistency, taint-mode#63
Open
arki05 wants to merge 42 commits intomoosefs:mainfrom
Open
Bundle of fixes: NBD lifecycle, snapshot cleanup, vzdump consistency, taint-mode#63arki05 wants to merge 42 commits intomoosefs:mainfrom
arki05 wants to merge 42 commits intomoosefs:mainfrom
Conversation
The mfsnbdlink config option is honored by map/unmap/list operations
throughout the plugin, but was ignored by the two functions that manage
the mfsbdev daemon lifecycle:
- moosefs_bdev_is_active hardcoded /dev/mfs/nbdsock, so once any storage
started a daemon on the default socket, every other storage's active
check returned true and skipped its own daemon startup.
- moosefs_start_bdev never passed -l to mfsbdev start, so daemons always
bound to the default socket regardless of per-storage configuration.
With multiple MooseFS storages pointing at different mfssubfolders, only
one daemon would actually start (rooted at the first storage's subfolder)
and subsequent map calls for other storages failed with
"error opening MFS file ...: No such file or directory" because the map
was looking up paths in the wrong subfolder view.
Fix both call sites to honor $scfg->{mfsnbdlink} so each storage gets its
own properly-rooted mfsbdev daemon on its own socket.
LXC containers use a loop device on top of the NBD device (loop0 -> nbd0). The existing unmap/remap flow in with_nbd_unmapped and volume_resize would orphan the loop device, causing I/O errors and data corruption in running containers. Changes: - Add nbd_has_loop_holder(): checks /sys/block/loop*/loop/backing_file to detect loop devices backed by an NBD device - with_nbd_unmapped: skip the unmap when a loop holder is detected and run the operation directly (mfsmakesnapshot is a metadata-only COW op) - volume_resize: use mfsbdev resize (in-place) instead of unmap/truncate/remap when a loop holder is detected, keeping the loop->nbd chain intact Tested: LXC create, resize (running), snapshot create/delete/rollback (running), snapshot+resize+rollback combo — all pass without data corruption.
alloc_image: Remove the eager mfsbdev map call. The NBD mapping now happens lazily via activate_volume -> map_volume when the volume is first used. Eager mapping caused symlink collisions in /dev/mfs/ during disk moves between pools on the same MooseFS master — the source volume's link name clashed with the destination allocation. nbd_device_holder_pid: Skip systemd-udevd alongside mfsbdev in the fd holder scan. udevd transiently opens block devices for probing and is not a real consumer — its presence blocked snapshot cleanup on running containers (vzdump 'vzdump' snapshot deletion failed). Tested: disk move between moosefs-standard and moosefs-fast succeeds, vzdump --mode snapshot completes with clean snapshot cleanup.
Five call sites used bare 'mfsbdev list' without -l, querying the default socket instead of the per-storage socket. This caused operations (filesystem_path, volume_resize, volume_export, map_volume, with_nbd_unmapped) to fail silently on multi-storage setups where each pool has its own mfsbdev daemon. Also fix moosefs_start_bdev to mkdir the socket parent directory (e.g. /dev/mfs/) before launching mfsbdev, and export make_path from File::Path to avoid 'undefined subroutine' at runtime.
Change nbd_device_holder_pid from a fatal refusal to a skip-unmap. mfsmakesnapshot operates at the MooseFS metadata level, not through the NBD block device, so it's safe to run while QEMU or a loop device holds the device open. Unmapping under a live consumer would corrupt its disk view — skipping the unmap is both safer and required for PVE's 'mixed' snapshot mode where QEMU expects the storage plugin to snapshot while the VM is running.
…bdev 'mixed' mode triggers QEMU external snapshots with backing chains, which requires qcow2. For raw/host_device (NBD), QEMU rejects the 'backing' parameter in blockdev-add. Return undef instead so PVE uses 'storage' mode: briefly pause the VM, call volume_snapshot (mfsmakesnapshot COW), then resume. The pause is typically sub-second for metadata-only ops.
Major simplification of the NBD lifecycle during snapshot operations: - Remove with_nbd_unmapped, nbd_device_holder_pid, nbd_has_loop_holder, rename_snapshot, get_mfsbdev_size — all dead or unnecessary code. mfsmakesnapshot is a metadata-only COW op safe while NBD is in use. - volume_resize: always use truncate + mfsbdev resize (in-place). - volume_qemu_snapshot_method: return undef (storage mode) instead of 'mixed'. QEMU's blockdev-add rejects 'backing' on raw/host_device. - volume_snapshot_delete: use unlink instead of mfsrmsnapshot. COW copies are regular files, not MooseFS internal snapshot references. - Replace 3 inline mfsbdev list parsers with moosefs_bdev_list_mappings. Remove redundant double list query in deactivate_volume. - Fix vzdump --mode snapshot: path() and activate_volume now skip NBD for snapshot volumes (accessed via FUSE). map_volume returns undef for snapshots as defense-in-depth. Net: -400 lines, vzdump snapshot backups now work end-to-end.
Add a sweep in activate_storage that unmaps NBD devices whose backing file no longer exists. This catches leaked mappings from: - Race conditions (concurrent qm stop + qm destroy) - Crashed daemons or interrupted operations - Manual file deletions Runs on each pvestatd activation cycle and during storage re-activation after restart. Only unmaps when the backing file is confirmed gone — never touches active volumes.
When all NBD devices are in use and map_volume falls back to the FUSE path, emit a warn so it shows up in syslog/journalctl. Previously this was a silent fallback — the user had no way to know their VM was running on FUSE instead of NBD.
Snapshot operations (create, delete, rollback) have been tested and confirmed working for both running VMs and LXCs. The Oct 2025 mfsbdev snapshotting issue has been resolved across multiple commits (moosefs#53, moosefs#58).
Integration/all fixes
Two fixes:
1. Snapshot cleanup on destroy (taint-safe):
Extract _cleanup_vmid_residuals helper that walks {vmid}/snaps/*
and removes snapshot files matching the volume being freed. PVE's
destroy_vm only calls free_image (not volume_snapshot_delete), so
file-based snapshot layouts leak on destroy.
All path components are regex-untainted for pvedaemon's -T mode —
without this, unlink/rmdir die with "Insecure dependency" which is
silently caught by the eval in destroy_vm.
2. Post-unmap NBD device health check:
After mfsbdev unmap succeeds, verify via /sys/block/nbdN/size that
the kernel device is truly released. If nonzero (stale state from
buffered I/O errors or incomplete teardown), flush buffers and
force-unmap by device path. Without this, the same /dev/nbdN gets
reused for the next VM with stale kernel state, causing mkfs
failures, EBADMSG reads, and I/O errors — flaky failures that only
manifest after many map/unmap cycles in a test suite.
The existing pvestatd drop-in (issue moosefs#60) only covers one of three PVE daemons that can spawn mfsbdev / mfsmount. When a user restarts pvedaemon or pveproxy (common after plugin upgrade), the default KillMode=control-group SIGKILLs every process in the cgroup — including any mfsbdev/mfsmount that activate_storage parented there. Killing mfsbdev mid-flight has two downstream effects: - FUSE mounts get "Transport endpoint is not connected" - MFS-level file locks leak, because mfsbdev never sends unmap. The next map attempt then fails with "MFS file is locked", forcing the plugin onto the slower FUSE fallback path. Shipping the same drop-in for pvedaemon and pveproxy closes both paths. postinst/prerm are generalised to loop over all three service.d dirs.
When mfsbdev is killed without a clean shutdown (SIGKILL from systemd's control-group KillMode, OOM kill, hard daemon restart, etc.) the MFS master keeps the file lock that mfsbdev held. The next mapping attempt then fails with "MFS file is locked (likely mapped elsewhere)" even though no mfsbdev is actually holding it. The existing retry loop only handles "link exists" races. On the lock error it falls through all 5 attempts and drops to FUSE fallback, which ruins performance until the lock eventually ages out (or never, if the master doesn't age session-bound locks). Handle this case: on the first "is locked" error, log a warning pointing at the likely cause and retry with -i (ignore_locks). We deliberately do NOT enable -i speculatively — if the error comes from a genuine concurrent mapping, we still want it to fail loudly.
If a MooseFS-backed volume is resized out of band — offline truncate via losetup, qemu-img resize against the FUSE path, manual recovery work like the disk-shrink flow — the underlying file picks up the new size but any live mfsbdev mapping keeps caching the length it was mapped with. The plugin then happily returns that /dev/nbdN to callers, who read the stale geometry. Concretely: `move disk` after a 2T→64G shrink still tries to copy 2T, and `qemu-img convert` either over-copies junk or truncates the destination — both of which we hit in practice. Add a tolerant helper, `moosefs_bdev_sync_size`, that reads the cached size out of `mfsbdev list`, compares it to the on-FUSE file size, and calls `mfsbdev resize` if they drift. Invoked from map_volume on the existing-mapping early-return path, before the device is handed back. The helper is deliberately best-effort: list/parse failures return success so we don't turn a diagnostic improvement into a new failure mode. Only a real `mfsbdev resize` failure surfaces, and even then map_volume still returns the device — the warn() makes the drift visible in task logs.
map_volume's contract is to return a usable path (filesystem or /dev/nbdN)
that callers hand straight to QEMU/swtpm/qemu-img. Several early-return
branches instead return \$class->SUPER::activate_volume(...), whose scalar
value is undef/empty — it only exists to die on missing files, not to
identify the volume.
The pattern was harmless while it only covered paths that never fire in
practice (non-images/non-raw vtypes, undefined volname). When the
small-disk NBD skip was added for files <= 8 MiB it reused the same
return — and that branch fires on every VM start for TPM state and other
small volumes. start_swtpm reads the return as a state URL and ends up
running:
swtpm_setup --tpmstate file:// ...
which fails with exit 1 and blocks VM start entirely. Legacy-named TPM
state files (vm-VMID-disk-N.raw rather than vm-VMID-tpmstate-N.raw) miss
the earlier /^vm-\d+-tpmstate/ shortcut, so they land on this broken
small-disk path. EFI survives because QemuServer routes EFI through a
different code path.
Fix: return \$class->filesystem_path(...) from the small-disk and
non-raw/non-images branches, and return undef from the undefined-volname
branch (which has no meaningful path to hand back anyway).
The KillMode=mixed drop-ins for pvestatd/pvedaemon/pveproxy close the hole where a daemon restart would SIGKILL mfsbdev, but only after the drop-in has taken effect — which itself requires a daemon restart. Upshot: the very first `systemctl restart <pve-daemon>` after upgrading the plugin (or a routine restart on a node that installed the plugin before the drop-in was available) still kills mfsbdev under the old KillMode=control-group. Kill mfsbdev mid-flight and the damage is real: the master-side session never gets to send unmap for its open NBD handles, those handles become "sustained" files on the master, and the chunks they reference never get reclaimed. A node that runs a few VM create/destroy cycles this way fills its MooseFS pool to 100% even though /mnt/<store>/images is empty — we tripped this tonight running the pve-storage-test-lab. Remove the parent-cgroup entanglement entirely: wrap the mfsbdev start command in a transient `systemd-run --scope` unit pinned to system.slice. Now mfsbdev lives in its own scope under the system slice, independent of whichever PVE daemon first called activate_storage. Any `systemctl restart` against pvestatd / pvedaemon / pveproxy is now harmless to the MooseFS bridge. The scope name is deterministic per-daemon (derived from the mfsnbdlink basename) so a subsequent call finds the running scope instead of spawning a second one; the "unit already exists" path is handled the same way as the existing "socket already in use" branch. The KillMode drop-ins stay — defense in depth, and they catch the degenerate case of some out-of-tree code spawning mfsbdev without going through this function.
Author
|
FYI - there is still minor issues i'm finding and fixing ( stale nbd things when updating / installing etc ) - but working through them - will push them here as well. ( also some bugs with tpm devices in some cases ) |
…oxy" This reverts commit a8321f1.
`systemd-run --scope` moves mfsbdev into its own cgroup (so a
restart of pvedaemon/pvestatd/pveproxy no longer kills it), but the
child still inherits every open file descriptor from whichever PVE
daemon happened to call activate_storage first.
pvestatd keeps an exclusive advisory lock on /run/pvestatd.pid.lock;
pvedaemon and pveproxy do the same for their own pid.lock files.
When mfsbdev inherits that fd and then outlives its spawning
daemon's restart, the daemon cannot re-acquire its own lock on the
next start:
pvestatd[...]: start failed - can't acquire lock
'/var/run/pvestatd.pid.lock' - Resource temporarily unavailable
The daemon ends up in a permanent "failed" state, and since pmxcfs
depends on pvestatd the whole cluster on that node unravels.
Wrap the mfsbdev start command in a shell that closes fds 3+ before
exec'ing. stdin/stdout/stderr (0/1/2) are kept — systemd-run
redirects those to the journal — so logging still works.
Five fixes, all verified by a full run of arki05/pve-storage-test-lab
against this branch:
* systemd-run scope for mfsbdev — mfsbdev now lives in its own
transient scope under system.slice, so restarting any PVE daemon
(pvestatd/pvedaemon/pveproxy) can no longer SIGKILL it. Killed
mfsbdev was the root cause of "sustained" MooseFS file handles
piling up to 100% pool fill after a handful of VM lifecycles.
* Close inherited fds before exec'ing mfsbdev — companion to the
scope change. Without this mfsbdev would inherit pvestatd's
/run/pvestatd.pid.lock fd and prevent pvestatd from re-locking
on restart, taking the whole node's cluster stack down.
* map_volume: return a real filesystem path from the small-disk
skip branch instead of SUPER::activate_volume's undef-ish return
value. Fixes VM start for configurations with ovmf + tpmstate0,
which otherwise got "swtpm_setup --tpmstate file://" with an
empty URL and bailed out before boot.
* map_volume: detect and repair NBD size drift when reusing an
existing mapping. Fixes move-disk / clone after an out-of-band
resize via the FUSE path (e.g. offline qemu-img resize or a
losetup+truncate disk-shrink workflow).
* map_volume: recover from stale MFS file locks via -i retry,
with a warn() pointing at the likely cause. Previously the
retry loop only handled "link exists" and would drop to the
FUSE fallback on "is locked", ruining performance until the
master eventually aged the lock out.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey - opening this as one big PR but very happy to split it up however you'd prefer (per-issue, per-fix, whatever's easier to review). Also fine with closing it and re-opening just the parts you want to take.
What's in here
A bunch of fixes that accumulated in my fork (
arki05/pve-moosefs) over the last few days while I was hammering on the plugin in a test environment. They roughly cluster around the issues already filed:volume_resizewas treating$sizeas KiB instead of bytesfilesystem_pathnot honoringwantarrayKillMode=mixedsystemd drop-in, sosystemctl restart pvestatdno longer SIGKILLs spawned mfsmount helpersfree_imagenow cleans up{vmid}/snaps/*(PVE'sdestroy_vmcallsfree_imagebut nevervolume_snapshot_delete, so snapshot files were leaking and breaking VMID reuse). Same code pathalso untaints inputs via the existing
_untaint_volhelper because pvedaemon runs with-Tand was silently dying onunlink.blockdev --rereadpt+ settle + force-unmap fallback) — the kernel doesn't always release the device cleanly, leaving stale state that breaks the next consumer of the same/dev/nbdN.syncbeforemfsmakesnapshot— without it, vzdump snapshot-mode backups of running LXCs fail withEBADMSGbecause the ext4 journal inside the raw image hasn't been committed yet.How I tested
Built a nested PVE 9 / Debian 13 cluster harness with a pytest suite covering snapshot/migrate/resize/backup/move on VMs and LXCs, including chained operations (snapshot-under-fio, online resize under load, concurrent
snapshots on multiple VMs). Source is at arki05/pve-storage-test-lab if you want to reproduce or extend it.
Results on a freshly-built lab:
moosefs/pve-moosefs mainThe upstream failures cascade off a single bug (cloudinit.raw not removed on destroy → next clone hits "File exists"), so 38 isn't really 38 distinct bugs. But the harness exercises a lot of paths that PVE plugins
typically get wrong, and this PR cleans them up.
Where I'd like your eyes
flushbufs → rereadpt → settle → force-unmap) is best-effort; if you'd rather have a more conservative version (just detect + warn, don't force) I can split that out.syncbeforemfsmakesnapshotis system-wide. If you'd rather scope it to the affected file only (e.g.,fdatasyncon the open fd, or via mfsbdev) I'm happy to redo.Thanks for building this plugin - it's been a great base to iterate on. Happy to take whatever feedback you have, in whatever form.