Skip to content

Persist z_seq across znode eviction#18573

Open
ixhamza wants to merge 2 commits into
openzfs:masterfrom
truenas:persist_znode_across_eviction
Open

Persist z_seq across znode eviction#18573
ixhamza wants to merge 2 commits into
openzfs:masterfrom
truenas:persist_znode_across_eviction

Conversation

@ixhamza

@ixhamza ixhamza commented May 21, 2026

Copy link
Copy Markdown
Member

Motivation and Context

Commit 312bdab advertises STATX_ATTR_CHANGE_MONOTONIC to knfsd and builds the NFSv4 change_cookie from (ctime.tv_sec << 32) | zp->z_seq. zp->z_seq is reset to a magic constant in zfs_znode_alloc(), so any event that drops the znode from cache (memory pressure, remount, reboot) brings the file back with the same ctime.tv_sec upper bits but a smaller z_seq in the lower bits, regressing the cookie within the same second.

NFSv4 clients that trust the monotonicity contract treat this as metadata they cannot rely on. VMware ESXi over NFSv4.1 reliably reproduces it with The file specified is not a virtual disk, causing a VM stored on the affected ZFS dataset to fail to power on.

Description

Persist zp->z_seq via a new SA attribute SA_ZPL_SEQ so it survives znode eviction. A new pflag bit ZFS_HAS_SEQ marks the file as carrying SA_ZPL_SEQ in its layout, mirroring the existing ZPL_PROJID/ZFS_PROJID pattern. The bit gates may_grow at SA tx-hold sites, choosing B_TRUE on the first add per file and B_FALSE thereafter, so steady-state operations pay no extra reservation.

A ZFS_PERSIST_SEQ() macro captures z_seq and sets the bit into the caller's bulk in one step, persisting both atomically alongside the file's other SA attributes. Every site that bumps z_seq uses it. zfs_znode_alloc() restores z_seq from SA_ZPL_SEQ when the bit is set.

No on-disk format change requiring a feature flag is needed. Older binaries preserve the new attribute and bit opaquely. The first modify by a patched binary lazily migrates each file.

How Has This Been Tested?

  • Before: ESXi VM fails to power on over NFSv4.1.
  • After: VM powers on successfully.
  • CI Testing

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Comment thread include/sys/zfs_znode.h Outdated
@ixhamza ixhamza force-pushed the persist_znode_across_eviction branch from eeb4661 to 56245e3 Compare June 1, 2026 21:14
@behlendorf behlendorf self-requested a review June 1, 2026 22:03
@ixhamza ixhamza force-pushed the persist_znode_across_eviction branch from 56245e3 to 3b204af Compare June 2, 2026 21:28
Comment thread include/sys/zfs_znode.h
Comment thread module/zfs/zfs_vnops.c
Comment thread module/zfs/zfs_vnops.c Outdated
@ixhamza ixhamza force-pushed the persist_znode_across_eviction branch 2 times, most recently from 95dc629 to c7dbebd Compare June 5, 2026 11:47
@ixhamza

ixhamza commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Updated per @amotin's private feedback. Rebased onto master.

Comment thread module/os/freebsd/zfs/zfs_znode_os.c Outdated
Comment thread module/os/linux/zfs/zfs_vnops_os.c
@ixhamza ixhamza force-pushed the persist_znode_across_eviction branch 3 times, most recently from a695cd9 to 28c1e17 Compare June 8, 2026 19:07
@ixhamza ixhamza requested a review from amotin June 10, 2026 13:58

@robn robn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of a low-form of drive-by review from me (ride-by review?).

@ixhamza ran this by me early and I said "looks plausible, and SAs have the nice property of being forward and backward compatible if assembled sensibly". I haven't looked since, but have reread through it now, and I still think that, and I trust from the review comments and the diffs that its been thought about in enough detail that I don't have to think it all through from first principles. (but I will if someone tells me to).

One thought though, have we actually tested that things still work right (fsvo) when importing a pool that has run on this change back into an older ZFS that hasn't? And/or, Linux<->FreeBSD? I mean, that's weird, and coming from cold I don't expect it'd make much different for NFS clients so long as the number doesn't go backwards. Just so long as we're not inadvertently causing a pool not to import or anything. I think its fine, but doesn't hurt to ask.

(Aside: SA API always feel so difficult...)

@ixhamza

ixhamza commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Thanks @robn for taking a look.

One thought though, have we actually tested that things still work right (fsvo) when importing a pool that has run on this change back into an older ZFS that hasn't? And/or, Linux<->FreeBSD? I mean, that's weird, and coming from cold I don't expect it'd make much different for NFS clients so long as the number doesn't go backwards. Just so long as we're not inadvertently causing a pool not to import or anything. I think its fine, but doesn't hurt to ask.

Linux<->FreeBSD imports cleanly both ways. Older <-> Newer ZFS imports fine too since SA_ZPL_SEQ is just an opaque SA attribute.

@ixhamza ixhamza force-pushed the persist_znode_across_eviction branch from 28c1e17 to 8f193da Compare June 15, 2026 12:16
@ixhamza

ixhamza commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Pushed a follow up commit covering a few additional modification sites where z_seq wasn't being bumped, surfaced while testing more thoroughly. Since the NFSv4 change_cookie now relies entirely on z_seq, wanted to make sure every modification path advances it. Rebased onto current master.

@ixhamza ixhamza requested review from amotin and robn June 15, 2026 12:19

@robn robn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bunch of comments, but all of the kind I would point at in a pair review and say "hmm, what's that about?" and you'll likely tell me and I'll learn something new (man that'd be SO MUCH FUN).

Assuming you know what you're doing and approving on that basis. This is a proper tour de force, nice work!

Comment thread module/os/linux/zfs/zpl_file.c
Comment thread module/os/linux/zfs/zpl_file.c Outdated
Comment on lines +389 to +391
mutex_enter(&zp->z_lock);
zp->z_seq++;
mutex_exit(&zp->z_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought was "how contended is this lock?". Is this called on the first fault for the mapping, or the first fault for each page in the mapping? The first is probably "eh", the second maybe is more of a concern?

The other thought is, is all change to z_seq everywhere protected by z_lock. Very difficult to follow unfortunately.

Both those could perhaps be obviated by making z_seq an atomic.

(to be clear: I have nothing to suggest these are problems, and don't understand all of this well enough to find that out without a lot of reading, so if you tell me its fine, then it is!)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed per page, but page_mkwrite only fires when a clean page goes dirty. Atomic made more sense, so I dropped the lock in page_mkwrite and zfs_create and switched z_seq operations to atomic.

* rather than a vmalloc'ed region.
*/
/*
* Bump z_seq when a clean page first transitions to dirty via an mmap store.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this only bumps on transitions read->write, what happens on subsequent writes to the same map if GETATTR arrives between those? Or is that those are uncommitted, so not visible?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtime only bumps when the page first goes dirty, not on later writes, and z_seq bumps on the same page_mkwrite. So the two stay in sync, the cookie matches mtime which is what NFS already relies on.

Comment on lines -744 to +777
error = -zfs_freesp(zp, offset + len, 0, 0, FALSE);
/*
* extend file: log=TRUE drives z_seq bump,
* mtime/ctime advance, and TX_TRUNCATE ZIL
* record; matches zfs_space().
*/
error = -zfs_freesp(zp, offset + len, 0, 0, TRUE);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this feels like an actual bugfix beyond bumping z_seq, yes? Which is fine if so, but then is there anything more we need to do here? Tests, close anything in the bug tracker, etc? (I didn't look right now, but I know quirks around hole-punching come up from time to time, and I wonder...)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's indeed a real bugfix. Can be reproduced with the following script:

zpool create tank -O mountpoint=/mnt/tank sda
cd /mnt/tank
truncate -s 4096 f
sync
# mtime ctime (seconds)
before=$(stat -c '%Y %Z' f)
sleep 2
# extend past EOF, no keep-size 
fallocate -l 1048576 f
after=$(stat -c '%Y %Z' f)
echo "before: $before"
echo "after:  $after"

Without the change before and after stay the same, with it the mtime/ctime advance. Moved it to its own commit and added a test, fallocate_extend_timestamps.

Comment thread module/os/linux/zfs/zpl_inode.c Outdated
Comment thread module/os/linux/zfs/zpl_xattr.c Outdated
ip->i_mode = ITOZ(ip)->z_mode = mode;
zpl_inode_set_ctime_to_ts(ip,
current_time(ip));
ITOZ(ip)->z_seq++;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It was these cases and others that made me wonder about the locking in zpl_page_mkwrite(), but like I say, didn't study them all).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the z_seq bumps are atomic now, including these xattr/ACL ones.

Comment thread module/os/linux/zfs/zfs_znode_os.c
ixhamza added 2 commits June 24, 2026 23:37
Commit 312bdab advertises STATX_ATTR_CHANGE_MONOTONIC and builds
the NFSv4 change_cookie from (ctime.tv_sec << 32) | zp->z_seq.
zp->z_seq is reset to a magic constant in zfs_znode_alloc(), so any
event that drops the znode from cache (memory pressure, remount,
reboot) regresses the lower bits of the cookie, a backward step
within the same second.

NFSv4 clients that trust this contract treat a regressed cookie as
evidence that the file's metadata cannot be relied on. VMware ESXi
over NFSv4.1 surfaces this as "The file specified is not a virtual
disk", and a VM stored on the affected NFS-exported ZFS dataset
fails to power on.

Widen z_seq to 64 bit and present it directly as the change_cookie,
dropping the ctime packing, so the cookie is a single monotonic
counter that no longer depends on the clock. FreeBSD's va_filerev
consumer also takes the wider value.

Persist z_seq via a new SA attribute SA_ZPL_SEQ. An in-core marker
zp->z_has_seq records whether the file already carries SA_ZPL_SEQ in
its layout; it is derived at load time and never stored on disk, so
no global pflag bit is consumed. ZFS_SEQ_MAY_GROW() keys off the
marker to grow the SA layout only on the first add per file;
ZFS_PERSIST_SEQ() then sets the marker and adds SEQ to the caller's
bulk alongside the file's other SA attributes. zfs_znode_alloc()
restores z_seq from SA_ZPL_SEQ when present and sets the marker;
zfs_rezget() recomputes the marker in place on rollback/recv without
disturbing the in-core z_seq, keeping the cookie monotonic.

A file written before this change carries no SA_ZPL_SEQ; on Linux it
is seeded with (ctime.tv_sec + 1) << 32 so the counter starts above
any pre-change cookie and stays monotonic across the upgrade. A
missing attribute is simply treated as not-yet-migrated, not an
error. FreeBSD never folded ctime into va_filerev, so it needs no
seed.

No feature flag or on-disk format change is needed: the new SA
attribute is keyed by name, so an implementation that does not know
it preserves it opaquely, and the first modify lazily migrates each
file. Covers both the Linux and FreeBSD ZPL.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Growing a file with fallocate updated its size but left mtime/ctime
unchanged and didn't log the change. A fallocate that changes the file
size should update mtime/ctime, and the change should be logged so it
survives a crash.
Pass log=TRUE to zfs_freesp() on the extend path so it updates the
timestamps and logs the size change, matching zfs_space(). Punch-hole
and zero-range already use this path and are unaffected.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
@ixhamza ixhamza force-pushed the persist_znode_across_eviction branch from 8f193da to 7d01e1b Compare June 24, 2026 19:09
boolean_t fuid_dirtied = B_FALSE;
boolean_t handle_eadir = B_FALSE;
sa_bulk_attr_t bulk[7], xattr_bulk[7];
sa_bulk_attr_t bulk[9], xattr_bulk[7];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like xattr_bulk is actually slightly oversized here.

Suggested change
sa_bulk_attr_t bulk[9], xattr_bulk[7];
sa_bulk_attr_t bulk[9], xattr_bulk[6];

These sizes look right, but there are so many conditionals it's hard to see. To make sure there are no future overflows how about adding a few asserts either after out2:, or better yet before the relevant sa_bulk_update().

        ASSERT3S(count, <=, 9);
        ASSERT3S(xattr_count, <=, 6);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants