Persist z_seq across znode eviction by ixhamza · Pull Request #18573 · openzfs/zfs

ixhamza · 2026-05-21T21:15:55Z

Motivation and Context

Commit 312bdab advertises STATX_ATTR_CHANGE_MONOTONIC to knfsd and builds the NFSv4 change_cookie from (ctime.tv_sec << 32) | zp->z_seq. zp->z_seq is reset to a magic constant in zfs_znode_alloc(), so any event that drops the znode from cache (memory pressure, remount, reboot) brings the file back with the same ctime.tv_sec upper bits but a smaller z_seq in the lower bits, regressing the cookie within the same second.

NFSv4 clients that trust the monotonicity contract treat this as metadata they cannot rely on. VMware ESXi over NFSv4.1 reliably reproduces it with The file specified is not a virtual disk, causing a VM stored on the affected ZFS dataset to fail to power on.

Description

Persist zp->z_seq via a new SA attribute SA_ZPL_SEQ so it survives znode eviction. A new pflag bit ZFS_HAS_SEQ marks the file as carrying SA_ZPL_SEQ in its layout, mirroring the existing ZPL_PROJID/ZFS_PROJID pattern. The bit gates may_grow at SA tx-hold sites, choosing B_TRUE on the first add per file and B_FALSE thereafter, so steady-state operations pay no extra reservation.

A ZFS_PERSIST_SEQ() macro captures z_seq and sets the bit into the caller's bulk in one step, persisting both atomically alongside the file's other SA attributes. Every site that bumps z_seq uses it. zfs_znode_alloc() restores z_seq from SA_ZPL_SEQ when the bit is set.

No on-disk format change requiring a feature flag is needed. Older binaries preserve the new attribute and bit opaquely. The first modify by a patched binary lazily migrates each file.

How Has This Been Tested?

Before: ESXi VM fails to power on over NFSv4.1.
After: VM powers on successfully.
CI Testing

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

ixhamza · 2026-06-05T11:52:19Z

Updated per @amotin's private feedback. Rebased onto master.

robn

Sort of a low-form of drive-by review from me (ride-by review?).

@ixhamza ran this by me early and I said "looks plausible, and SAs have the nice property of being forward and backward compatible if assembled sensibly". I haven't looked since, but have reread through it now, and I still think that, and I trust from the review comments and the diffs that its been thought about in enough detail that I don't have to think it all through from first principles. (but I will if someone tells me to).

One thought though, have we actually tested that things still work right (fsvo) when importing a pool that has run on this change back into an older ZFS that hasn't? And/or, Linux<->FreeBSD? I mean, that's weird, and coming from cold I don't expect it'd make much different for NFS clients so long as the number doesn't go backwards. Just so long as we're not inadvertently causing a pool not to import or anything. I think its fine, but doesn't hurt to ask.

(Aside: SA API always feel so difficult...)

ixhamza · 2026-06-11T09:55:23Z

Thanks @robn for taking a look.

One thought though, have we actually tested that things still work right (fsvo) when importing a pool that has run on this change back into an older ZFS that hasn't? And/or, Linux<->FreeBSD? I mean, that's weird, and coming from cold I don't expect it'd make much different for NFS clients so long as the number doesn't go backwards. Just so long as we're not inadvertently causing a pool not to import or anything. I think its fine, but doesn't hurt to ask.

Linux<->FreeBSD imports cleanly both ways. Older <-> Newer ZFS imports fine too since SA_ZPL_SEQ is just an opaque SA attribute.

ixhamza · 2026-06-15T12:17:27Z

Pushed a follow up commit covering a few additional modification sites where z_seq wasn't being bumped, surfaced while testing more thoroughly. Since the NFSv4 change_cookie now relies entirely on z_seq, wanted to make sure every modification path advances it. Rebased onto current master.

robn

Bunch of comments, but all of the kind I would point at in a pair review and say "hmm, what's that about?" and you'll likely tell me and I'll learn something new (man that'd be SO MUCH FUN).

Assuming you know what you're doing and approving on that basis. This is a proper tour de force, nice work!

robn · 2026-06-24T01:32:12Z

+	mutex_enter(&zp->z_lock);
+	zp->z_seq++;
+	mutex_exit(&zp->z_lock);


My first thought was "how contended is this lock?". Is this called on the first fault for the mapping, or the first fault for each page in the mapping? The first is probably "eh", the second maybe is more of a concern?

The other thought is, is all change to z_seq everywhere protected by z_lock. Very difficult to follow unfortunately.

Both those could perhaps be obviated by making z_seq an atomic.

(to be clear: I have nothing to suggest these are problems, and don't understand all of this well enough to find that out without a lot of reading, so if you tell me its fine, then it is!)

It's indeed per page, but page_mkwrite only fires when a clean page goes dirty. Atomic made more sense, so I dropped the lock in page_mkwrite and zfs_create and switched z_seq operations to atomic.

robn · 2026-06-24T01:33:41Z

 * rather than a vmalloc'ed region.
 */
+/*
+ * Bump z_seq when a clean page first transitions to dirty via an mmap store.


If this only bumps on transitions read->write, what happens on subsequent writes to the same map if GETATTR arrives between those? Or is that those are uncommitted, so not visible?

mtime only bumps when the page first goes dirty, not on later writes, and z_seq bumps on the same page_mkwrite. So the two stay in sync, the cookie matches mtime which is what NFS already relies on.

robn · 2026-06-24T01:41:05Z

-			error = -zfs_freesp(zp, offset + len, 0, 0, FALSE);
+			/*
+			 * extend file: log=TRUE drives z_seq bump,
+			 * mtime/ctime advance, and TX_TRUNCATE ZIL
+			 * record; matches zfs_space().
+			 */
+			error = -zfs_freesp(zp, offset + len, 0, 0, TRUE);


Hmm, this feels like an actual bugfix beyond bumping z_seq, yes? Which is fine if so, but then is there anything more we need to do here? Tests, close anything in the bug tracker, etc? (I didn't look right now, but I know quirks around hole-punching come up from time to time, and I wonder...)

Yeah, it's indeed a real bugfix. Can be reproduced with the following script:

zpool create tank -O mountpoint=/mnt/tank sda cd /mnt/tank truncate -s 4096 f sync # mtime ctime (seconds) before=$(stat -c '%Y %Z' f) sleep 2 # extend past EOF, no keep-size fallocate -l 1048576 f after=$(stat -c '%Y %Z' f) echo "before: $before" echo "after: $after"

Without the change before and after stay the same, with it the mtime/ctime advance. Moved it to its own commit and added a test, fallocate_extend_timestamps.

robn · 2026-06-24T01:43:13Z

 					ip->i_mode = ITOZ(ip)->z_mode = mode;
 					zpl_inode_set_ctime_to_ts(ip,
 					    current_time(ip));
+					ITOZ(ip)->z_seq++;


(It was these cases and others that made me wonder about the locking in zpl_page_mkwrite(), but like I say, didn't study them all).

All the z_seq bumps are atomic now, including these xattr/ACL ones.

Commit 312bdab advertises STATX_ATTR_CHANGE_MONOTONIC and builds the NFSv4 change_cookie from (ctime.tv_sec << 32) | zp->z_seq. zp->z_seq is reset to a magic constant in zfs_znode_alloc(), so any event that drops the znode from cache (memory pressure, remount, reboot) regresses the lower bits of the cookie, a backward step within the same second. NFSv4 clients that trust this contract treat a regressed cookie as evidence that the file's metadata cannot be relied on. VMware ESXi over NFSv4.1 surfaces this as "The file specified is not a virtual disk", and a VM stored on the affected NFS-exported ZFS dataset fails to power on. Widen z_seq to 64 bit and present it directly as the change_cookie, dropping the ctime packing, so the cookie is a single monotonic counter that no longer depends on the clock. FreeBSD's va_filerev consumer also takes the wider value. Persist z_seq via a new SA attribute SA_ZPL_SEQ. An in-core marker zp->z_has_seq records whether the file already carries SA_ZPL_SEQ in its layout; it is derived at load time and never stored on disk, so no global pflag bit is consumed. ZFS_SEQ_MAY_GROW() keys off the marker to grow the SA layout only on the first add per file; ZFS_PERSIST_SEQ() then sets the marker and adds SEQ to the caller's bulk alongside the file's other SA attributes. zfs_znode_alloc() restores z_seq from SA_ZPL_SEQ when present and sets the marker; zfs_rezget() recomputes the marker in place on rollback/recv without disturbing the in-core z_seq, keeping the cookie monotonic. A file written before this change carries no SA_ZPL_SEQ; on Linux it is seeded with (ctime.tv_sec + 1) << 32 so the counter starts above any pre-change cookie and stays monotonic across the upgrade. A missing attribute is simply treated as not-yet-migrated, not an error. FreeBSD never folded ctime into va_filerev, so it needs no seed. No feature flag or on-disk format change is needed: the new SA attribute is keyed by name, so an implementation that does not know it preserves it opaquely, and the first modify lazily migrates each file. Covers both the Linux and FreeBSD ZPL. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Growing a file with fallocate updated its size but left mtime/ctime unchanged and didn't log the change. A fallocate that changes the file size should update mtime/ctime, and the change should be logged so it survives a crash. Pass log=TRUE to zfs_freesp() on the extend path so it updates the timestamps and logs the size change, matching zfs_space(). Punch-hole and zero-range already use this path and are unaffected. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

behlendorf · 2026-06-24T19:54:33Z

 	boolean_t	fuid_dirtied = B_FALSE;
 	boolean_t	handle_eadir = B_FALSE;
-	sa_bulk_attr_t	bulk[7], xattr_bulk[7];
+	sa_bulk_attr_t	bulk[9], xattr_bulk[7];


It looks to me like xattr_bulk is actually slightly oversized here.

Suggested change

sa_bulk_attr_t bulk[9], xattr_bulk[7];

sa_bulk_attr_t bulk[9], xattr_bulk[6];

These sizes look right, but there are so many conditionals it's hard to see. To make sure there are no future overflows how about adding a few asserts either after out2:, or better yet before the relevant sa_bulk_update().

ASSERT3S(count, <=, 9); ASSERT3S(xattr_count, <=, 6);

robn self-requested a review May 22, 2026 00:11

ixhamza force-pushed the persist_znode_across_eviction branch from 107a3df to eeb4661 Compare May 22, 2026 08:31

behlendorf added the Status: Code Review Needed Ready for review and testing label May 22, 2026

This was referenced May 26, 2026

NAS-141170 / 25.10.4 / Revert "Add handling for STATX_CHANGE_COOKIE (#343)" truenas/zfs#392

Merged

NAS-141170 / 27.0.0-BETA.1 / Revert "Add handling for STATX_CHANGE_COOKIE (#343)" truenas/zfs#393

Merged

bugclerk mentioned this pull request May 26, 2026

NAS-141170 / 26.0.0-RC.1 / Revert "Add handling for STATX_CHANGE_COOKIE (#343)" (by ixhamza) truenas/zfs#394

Merged

14 tasks

amotin reviewed May 28, 2026

View reviewed changes

Comment thread include/sys/zfs_znode.h Outdated

ixhamza force-pushed the persist_znode_across_eviction branch from eeb4661 to 56245e3 Compare June 1, 2026 21:14

behlendorf self-requested a review June 1, 2026 22:03

ixhamza force-pushed the persist_znode_across_eviction branch from 56245e3 to 3b204af Compare June 2, 2026 21:28

behlendorf reviewed Jun 2, 2026

View reviewed changes

Comment thread include/sys/zfs_znode.h

Comment thread module/zfs/zfs_vnops.c

Comment thread module/zfs/zfs_vnops.c Outdated

ixhamza force-pushed the persist_znode_across_eviction branch 2 times, most recently from 95dc629 to c7dbebd Compare June 5, 2026 11:47

amotin reviewed Jun 5, 2026

View reviewed changes

Comment thread module/os/freebsd/zfs/zfs_znode_os.c Outdated

Comment thread module/os/linux/zfs/zfs_vnops_os.c

ixhamza force-pushed the persist_znode_across_eviction branch 3 times, most recently from a695cd9 to 28c1e17 Compare June 8, 2026 19:07

ixhamza requested a review from amotin June 10, 2026 13:58

amotin approved these changes Jun 10, 2026

View reviewed changes

robn approved these changes Jun 11, 2026

View reviewed changes

ixhamza force-pushed the persist_znode_across_eviction branch from 28c1e17 to 8f193da Compare June 15, 2026 12:16

ixhamza requested review from amotin and robn June 15, 2026 12:19

robn approved these changes Jun 24, 2026

View reviewed changes

ixhamza added 2 commits June 24, 2026 23:37

ixhamza force-pushed the persist_znode_across_eviction branch from 8f193da to 7d01e1b Compare June 24, 2026 19:09

amotin approved these changes Jun 24, 2026

View reviewed changes

behlendorf approved these changes Jun 24, 2026

View reviewed changes

	sa_bulk_attr_t bulk[9], xattr_bulk[7];
	sa_bulk_attr_t bulk[9], xattr_bulk[6];

Uh oh!

Conversation

ixhamza commented May 21, 2026

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ixhamza commented Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

robn left a comment

Choose a reason for hiding this comment

Uh oh!

ixhamza commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ixhamza commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ixhamza commented Jun 11, 2026 •

edited

Loading

ixhamza commented Jun 15, 2026 •

edited

Loading