Skip to content

Bundle of fixes: NBD lifecycle, snapshot cleanup, vzdump consistency, taint-mode#63

Open
arki05 wants to merge 42 commits intomoosefs:mainfrom
arki05:main
Open

Bundle of fixes: NBD lifecycle, snapshot cleanup, vzdump consistency, taint-mode#63
arki05 wants to merge 42 commits intomoosefs:mainfrom
arki05:main

Conversation

@arki05
Copy link
Copy Markdown

@arki05 arki05 commented Apr 17, 2026

Hey - opening this as one big PR but very happy to split it up however you'd prefer (per-issue, per-fix, whatever's easier to review). Also fine with closing it and re-opening just the parts you want to take.

What's in here

A bunch of fixes that accumulated in my fork (arki05/pve-moosefs) over the last few days while I was hammering on the plugin in a test environment. They roughly cluster around the issues already filed:

How I tested

Built a nested PVE 9 / Debian 13 cluster harness with a pytest suite covering snapshot/migrate/resize/backup/move on VMs and LXCs, including chained operations (snapshot-under-fio, online resize under load, concurrent
snapshots on multiple VMs). Source is at arki05/pve-storage-test-lab if you want to reproduce or extend it.

Results on a freshly-built lab:

Plugin Passed Failed Skipped
upstream moosefs/pve-moosefs main 13 38 0
this PR 51 0 0

The upstream failures cascade off a single bug (cloudinit.raw not removed on destroy → next clone hits "File exists"), so 38 isn't really 38 distinct bugs. But the harness exercises a lot of paths that PVE plugins
typically get wrong, and this PR cleans them up.

Where I'd like your eyes

  • The post-unmap escalation (flushbufs → rereadpt → settle → force-unmap) is best-effort; if you'd rather have a more conservative version (just detect + warn, don't force) I can split that out.
  • The sync before mfsmakesnapshot is system-wide. If you'd rather scope it to the affected file only (e.g., fdatasync on the open fd, or via mfsbdev) I'm happy to redo.

Thanks for building this plugin - it's been a great base to iterate on. Happy to take whatever feedback you have, in whatever form.

arki05 added 30 commits April 11, 2026 02:09
The mfsnbdlink config option is honored by map/unmap/list operations
throughout the plugin, but was ignored by the two functions that manage
the mfsbdev daemon lifecycle:

- moosefs_bdev_is_active hardcoded /dev/mfs/nbdsock, so once any storage
  started a daemon on the default socket, every other storage's active
  check returned true and skipped its own daemon startup.
- moosefs_start_bdev never passed -l to mfsbdev start, so daemons always
  bound to the default socket regardless of per-storage configuration.

With multiple MooseFS storages pointing at different mfssubfolders, only
one daemon would actually start (rooted at the first storage's subfolder)
and subsequent map calls for other storages failed with
"error opening MFS file ...: No such file or directory" because the map
was looking up paths in the wrong subfolder view.

Fix both call sites to honor $scfg->{mfsnbdlink} so each storage gets its
own properly-rooted mfsbdev daemon on its own socket.
LXC containers use a loop device on top of the NBD device (loop0 -> nbd0).
The existing unmap/remap flow in with_nbd_unmapped and volume_resize would
orphan the loop device, causing I/O errors and data corruption in running
containers.

Changes:
- Add nbd_has_loop_holder(): checks /sys/block/loop*/loop/backing_file
  to detect loop devices backed by an NBD device
- with_nbd_unmapped: skip the unmap when a loop holder is detected and
  run the operation directly (mfsmakesnapshot is a metadata-only COW op)
- volume_resize: use mfsbdev resize (in-place) instead of unmap/truncate/remap
  when a loop holder is detected, keeping the loop->nbd chain intact

Tested: LXC create, resize (running), snapshot create/delete/rollback (running),
snapshot+resize+rollback combo — all pass without data corruption.
alloc_image: Remove the eager mfsbdev map call. The NBD mapping now
happens lazily via activate_volume -> map_volume when the volume is
first used. Eager mapping caused symlink collisions in /dev/mfs/
during disk moves between pools on the same MooseFS master — the
source volume's link name clashed with the destination allocation.

nbd_device_holder_pid: Skip systemd-udevd alongside mfsbdev in the
fd holder scan. udevd transiently opens block devices for probing
and is not a real consumer — its presence blocked snapshot cleanup
on running containers (vzdump 'vzdump' snapshot deletion failed).

Tested: disk move between moosefs-standard and moosefs-fast succeeds,
vzdump --mode snapshot completes with clean snapshot cleanup.
Five call sites used bare 'mfsbdev list' without -l, querying the
default socket instead of the per-storage socket. This caused
operations (filesystem_path, volume_resize, volume_export, map_volume,
with_nbd_unmapped) to fail silently on multi-storage setups where
each pool has its own mfsbdev daemon.

Also fix moosefs_start_bdev to mkdir the socket parent directory
(e.g. /dev/mfs/) before launching mfsbdev, and export make_path
from File::Path to avoid 'undefined subroutine' at runtime.
Change nbd_device_holder_pid from a fatal refusal to a skip-unmap.
mfsmakesnapshot operates at the MooseFS metadata level, not through
the NBD block device, so it's safe to run while QEMU or a loop device
holds the device open. Unmapping under a live consumer would corrupt
its disk view — skipping the unmap is both safer and required for
PVE's 'mixed' snapshot mode where QEMU expects the storage plugin
to snapshot while the VM is running.
…bdev

'mixed' mode triggers QEMU external snapshots with backing chains, which
requires qcow2. For raw/host_device (NBD), QEMU rejects the 'backing'
parameter in blockdev-add. Return undef instead so PVE uses 'storage'
mode: briefly pause the VM, call volume_snapshot (mfsmakesnapshot COW),
then resume. The pause is typically sub-second for metadata-only ops.
Major simplification of the NBD lifecycle during snapshot operations:

- Remove with_nbd_unmapped, nbd_device_holder_pid, nbd_has_loop_holder,
  rename_snapshot, get_mfsbdev_size — all dead or unnecessary code.
  mfsmakesnapshot is a metadata-only COW op safe while NBD is in use.

- volume_resize: always use truncate + mfsbdev resize (in-place).

- volume_qemu_snapshot_method: return undef (storage mode) instead of
  'mixed'. QEMU's blockdev-add rejects 'backing' on raw/host_device.

- volume_snapshot_delete: use unlink instead of mfsrmsnapshot. COW
  copies are regular files, not MooseFS internal snapshot references.

- Replace 3 inline mfsbdev list parsers with moosefs_bdev_list_mappings.
  Remove redundant double list query in deactivate_volume.

- Fix vzdump --mode snapshot: path() and activate_volume now skip NBD
  for snapshot volumes (accessed via FUSE). map_volume returns undef
  for snapshots as defense-in-depth.

Net: -400 lines, vzdump snapshot backups now work end-to-end.
Add a sweep in activate_storage that unmaps NBD devices whose backing
file no longer exists. This catches leaked mappings from:
- Race conditions (concurrent qm stop + qm destroy)
- Crashed daemons or interrupted operations
- Manual file deletions

Runs on each pvestatd activation cycle and during storage re-activation
after restart. Only unmaps when the backing file is confirmed gone —
never touches active volumes.
arki05 added 4 commits April 12, 2026 22:43
When all NBD devices are in use and map_volume falls back to the FUSE
path, emit a warn so it shows up in syslog/journalctl. Previously
this was a silent fallback — the user had no way to know their VM
was running on FUSE instead of NBD.
Snapshot operations (create, delete, rollback) have been tested and
confirmed working for both running VMs and LXCs. The Oct 2025 mfsbdev
snapshotting issue has been resolved across multiple commits (moosefs#53, moosefs#58).
Two fixes:

1. Snapshot cleanup on destroy (taint-safe):
   Extract _cleanup_vmid_residuals helper that walks {vmid}/snaps/*
   and removes snapshot files matching the volume being freed. PVE's
   destroy_vm only calls free_image (not volume_snapshot_delete), so
   file-based snapshot layouts leak on destroy.
   All path components are regex-untainted for pvedaemon's -T mode —
   without this, unlink/rmdir die with "Insecure dependency" which is
   silently caught by the eval in destroy_vm.

2. Post-unmap NBD device health check:
   After mfsbdev unmap succeeds, verify via /sys/block/nbdN/size that
   the kernel device is truly released. If nonzero (stale state from
   buffered I/O errors or incomplete teardown), flush buffers and
   force-unmap by device path. Without this, the same /dev/nbdN gets
   reused for the next VM with stale kernel state, causing mkfs
   failures, EBADMSG reads, and I/O errors — flaky failures that only
   manifest after many map/unmap cycles in a test suite.
@pkonopelko pkonopelko requested a review from Zorlin April 17, 2026 13:04
arki05 added 5 commits April 18, 2026 03:38
The existing pvestatd drop-in (issue moosefs#60) only covers one of three PVE
daemons that can spawn mfsbdev / mfsmount. When a user restarts pvedaemon
or pveproxy (common after plugin upgrade), the default KillMode=control-group
SIGKILLs every process in the cgroup — including any mfsbdev/mfsmount that
activate_storage parented there.

Killing mfsbdev mid-flight has two downstream effects:
  - FUSE mounts get "Transport endpoint is not connected"
  - MFS-level file locks leak, because mfsbdev never sends unmap. The next
    map attempt then fails with "MFS file is locked", forcing the plugin
    onto the slower FUSE fallback path.

Shipping the same drop-in for pvedaemon and pveproxy closes both paths.
postinst/prerm are generalised to loop over all three service.d dirs.
When mfsbdev is killed without a clean shutdown (SIGKILL from systemd's
control-group KillMode, OOM kill, hard daemon restart, etc.) the MFS
master keeps the file lock that mfsbdev held. The next mapping attempt
then fails with "MFS file is locked (likely mapped elsewhere)" even
though no mfsbdev is actually holding it.

The existing retry loop only handles "link exists" races. On the lock
error it falls through all 5 attempts and drops to FUSE fallback, which
ruins performance until the lock eventually ages out (or never, if the
master doesn't age session-bound locks).

Handle this case: on the first "is locked" error, log a warning pointing
at the likely cause and retry with -i (ignore_locks). We deliberately do
NOT enable -i speculatively — if the error comes from a genuine
concurrent mapping, we still want it to fail loudly.
If a MooseFS-backed volume is resized out of band — offline truncate via
losetup, qemu-img resize against the FUSE path, manual recovery work like
the disk-shrink flow — the underlying file picks up the new size but any
live mfsbdev mapping keeps caching the length it was mapped with. The
plugin then happily returns that /dev/nbdN to callers, who read the stale
geometry. Concretely: `move disk` after a 2T→64G shrink still tries to
copy 2T, and `qemu-img convert` either over-copies junk or truncates the
destination — both of which we hit in practice.

Add a tolerant helper, `moosefs_bdev_sync_size`, that reads the cached
size out of `mfsbdev list`, compares it to the on-FUSE file size, and
calls `mfsbdev resize` if they drift. Invoked from map_volume on the
existing-mapping early-return path, before the device is handed back.

The helper is deliberately best-effort: list/parse failures return
success so we don't turn a diagnostic improvement into a new failure
mode. Only a real `mfsbdev resize` failure surfaces, and even then
map_volume still returns the device — the warn() makes the drift
visible in task logs.
map_volume's contract is to return a usable path (filesystem or /dev/nbdN)
that callers hand straight to QEMU/swtpm/qemu-img. Several early-return
branches instead return \$class->SUPER::activate_volume(...), whose scalar
value is undef/empty — it only exists to die on missing files, not to
identify the volume.

The pattern was harmless while it only covered paths that never fire in
practice (non-images/non-raw vtypes, undefined volname). When the
small-disk NBD skip was added for files <= 8 MiB it reused the same
return — and that branch fires on every VM start for TPM state and other
small volumes. start_swtpm reads the return as a state URL and ends up
running:

    swtpm_setup --tpmstate file:// ...

which fails with exit 1 and blocks VM start entirely. Legacy-named TPM
state files (vm-VMID-disk-N.raw rather than vm-VMID-tpmstate-N.raw) miss
the earlier /^vm-\d+-tpmstate/ shortcut, so they land on this broken
small-disk path. EFI survives because QemuServer routes EFI through a
different code path.

Fix: return \$class->filesystem_path(...) from the small-disk and
non-raw/non-images branches, and return undef from the undefined-volname
branch (which has no meaningful path to hand back anyway).
The KillMode=mixed drop-ins for pvestatd/pvedaemon/pveproxy close the
hole where a daemon restart would SIGKILL mfsbdev, but only after the
drop-in has taken effect — which itself requires a daemon restart.
Upshot: the very first `systemctl restart <pve-daemon>` after upgrading
the plugin (or a routine restart on a node that installed the plugin
before the drop-in was available) still kills mfsbdev under the old
KillMode=control-group.

Kill mfsbdev mid-flight and the damage is real: the master-side session
never gets to send unmap for its open NBD handles, those handles become
"sustained" files on the master, and the chunks they reference never
get reclaimed. A node that runs a few VM create/destroy cycles this way
fills its MooseFS pool to 100% even though /mnt/<store>/images is
empty — we tripped this tonight running the pve-storage-test-lab.

Remove the parent-cgroup entanglement entirely: wrap the mfsbdev start
command in a transient `systemd-run --scope` unit pinned to
system.slice. Now mfsbdev lives in its own scope under the system
slice, independent of whichever PVE daemon first called
activate_storage. Any `systemctl restart` against pvestatd / pvedaemon
/ pveproxy is now harmless to the MooseFS bridge.

The scope name is deterministic per-daemon (derived from the
mfsnbdlink basename) so a subsequent call finds the running scope
instead of spawning a second one; the "unit already exists" path is
handled the same way as the existing "socket already in use" branch.

The KillMode drop-ins stay — defense in depth, and they catch the
degenerate case of some out-of-tree code spawning mfsbdev without
going through this function.
@arki05
Copy link
Copy Markdown
Author

arki05 commented Apr 18, 2026

FYI - there is still minor issues i'm finding and fixing ( stale nbd things when updating / installing etc ) - but working through them - will push them here as well. ( also some bugs with tpm devices in some cases )

arki05 added 3 commits April 18, 2026 16:56
`systemd-run --scope` moves mfsbdev into its own cgroup (so a
restart of pvedaemon/pvestatd/pveproxy no longer kills it), but the
child still inherits every open file descriptor from whichever PVE
daemon happened to call activate_storage first.

pvestatd keeps an exclusive advisory lock on /run/pvestatd.pid.lock;
pvedaemon and pveproxy do the same for their own pid.lock files.
When mfsbdev inherits that fd and then outlives its spawning
daemon's restart, the daemon cannot re-acquire its own lock on the
next start:

    pvestatd[...]: start failed - can't acquire lock
    '/var/run/pvestatd.pid.lock' - Resource temporarily unavailable

The daemon ends up in a permanent "failed" state, and since pmxcfs
depends on pvestatd the whole cluster on that node unravels.

Wrap the mfsbdev start command in a shell that closes fds 3+ before
exec'ing. stdin/stdout/stderr (0/1/2) are kept — systemd-run
redirects those to the journal — so logging still works.
Five fixes, all verified by a full run of arki05/pve-storage-test-lab
against this branch:

  * systemd-run scope for mfsbdev — mfsbdev now lives in its own
    transient scope under system.slice, so restarting any PVE daemon
    (pvestatd/pvedaemon/pveproxy) can no longer SIGKILL it. Killed
    mfsbdev was the root cause of "sustained" MooseFS file handles
    piling up to 100% pool fill after a handful of VM lifecycles.

  * Close inherited fds before exec'ing mfsbdev — companion to the
    scope change. Without this mfsbdev would inherit pvestatd's
    /run/pvestatd.pid.lock fd and prevent pvestatd from re-locking
    on restart, taking the whole node's cluster stack down.

  * map_volume: return a real filesystem path from the small-disk
    skip branch instead of SUPER::activate_volume's undef-ish return
    value. Fixes VM start for configurations with ovmf + tpmstate0,
    which otherwise got "swtpm_setup --tpmstate file://" with an
    empty URL and bailed out before boot.

  * map_volume: detect and repair NBD size drift when reusing an
    existing mapping. Fixes move-disk / clone after an out-of-band
    resize via the FUSE path (e.g. offline qemu-img resize or a
    losetup+truncate disk-shrink workflow).

  * map_volume: recover from stale MFS file locks via -i retry,
    with a warn() pointing at the likely cause. Previously the
    retry loop only handled "link exists" and would drop to the
    FUSE fallback on "is locked", ruining performance until the
    master eventually aged the lock out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant