Persist z_seq across znode eviction#18573
Conversation
107a3df to
eeb4661
Compare
eeb4661 to
56245e3
Compare
56245e3 to
3b204af
Compare
95dc629 to
c7dbebd
Compare
|
Updated per @amotin's private feedback. Rebased onto master. |
a695cd9 to
28c1e17
Compare
robn
left a comment
There was a problem hiding this comment.
Sort of a low-form of drive-by review from me (ride-by review?).
@ixhamza ran this by me early and I said "looks plausible, and SAs have the nice property of being forward and backward compatible if assembled sensibly". I haven't looked since, but have reread through it now, and I still think that, and I trust from the review comments and the diffs that its been thought about in enough detail that I don't have to think it all through from first principles. (but I will if someone tells me to).
One thought though, have we actually tested that things still work right (fsvo) when importing a pool that has run on this change back into an older ZFS that hasn't? And/or, Linux<->FreeBSD? I mean, that's weird, and coming from cold I don't expect it'd make much different for NFS clients so long as the number doesn't go backwards. Just so long as we're not inadvertently causing a pool not to import or anything. I think its fine, but doesn't hurt to ask.
(Aside: SA API always feel so difficult...)
|
Thanks @robn for taking a look.
Linux<->FreeBSD imports cleanly both ways. Older <-> Newer ZFS imports fine too since SA_ZPL_SEQ is just an opaque SA attribute. |
28c1e17 to
8f193da
Compare
|
Pushed a follow up commit covering a few additional modification sites where |
robn
left a comment
There was a problem hiding this comment.
Bunch of comments, but all of the kind I would point at in a pair review and say "hmm, what's that about?" and you'll likely tell me and I'll learn something new (man that'd be SO MUCH FUN).
Assuming you know what you're doing and approving on that basis. This is a proper tour de force, nice work!
| mutex_enter(&zp->z_lock); | ||
| zp->z_seq++; | ||
| mutex_exit(&zp->z_lock); |
There was a problem hiding this comment.
My first thought was "how contended is this lock?". Is this called on the first fault for the mapping, or the first fault for each page in the mapping? The first is probably "eh", the second maybe is more of a concern?
The other thought is, is all change to z_seq everywhere protected by z_lock. Very difficult to follow unfortunately.
Both those could perhaps be obviated by making z_seq an atomic.
(to be clear: I have nothing to suggest these are problems, and don't understand all of this well enough to find that out without a lot of reading, so if you tell me its fine, then it is!)
There was a problem hiding this comment.
It's indeed per page, but page_mkwrite only fires when a clean page goes dirty. Atomic made more sense, so I dropped the lock in page_mkwrite and zfs_create and switched z_seq operations to atomic.
| * rather than a vmalloc'ed region. | ||
| */ | ||
| /* | ||
| * Bump z_seq when a clean page first transitions to dirty via an mmap store. |
There was a problem hiding this comment.
If this only bumps on transitions read->write, what happens on subsequent writes to the same map if GETATTR arrives between those? Or is that those are uncommitted, so not visible?
There was a problem hiding this comment.
mtime only bumps when the page first goes dirty, not on later writes, and z_seq bumps on the same page_mkwrite. So the two stay in sync, the cookie matches mtime which is what NFS already relies on.
| error = -zfs_freesp(zp, offset + len, 0, 0, FALSE); | ||
| /* | ||
| * extend file: log=TRUE drives z_seq bump, | ||
| * mtime/ctime advance, and TX_TRUNCATE ZIL | ||
| * record; matches zfs_space(). | ||
| */ | ||
| error = -zfs_freesp(zp, offset + len, 0, 0, TRUE); |
There was a problem hiding this comment.
Hmm, this feels like an actual bugfix beyond bumping z_seq, yes? Which is fine if so, but then is there anything more we need to do here? Tests, close anything in the bug tracker, etc? (I didn't look right now, but I know quirks around hole-punching come up from time to time, and I wonder...)
There was a problem hiding this comment.
Yeah, it's indeed a real bugfix. Can be reproduced with the following script:
zpool create tank -O mountpoint=/mnt/tank sda
cd /mnt/tank
truncate -s 4096 f
sync
# mtime ctime (seconds)
before=$(stat -c '%Y %Z' f)
sleep 2
# extend past EOF, no keep-size
fallocate -l 1048576 f
after=$(stat -c '%Y %Z' f)
echo "before: $before"
echo "after: $after"Without the change before and after stay the same, with it the mtime/ctime advance. Moved it to its own commit and added a test, fallocate_extend_timestamps.
| ip->i_mode = ITOZ(ip)->z_mode = mode; | ||
| zpl_inode_set_ctime_to_ts(ip, | ||
| current_time(ip)); | ||
| ITOZ(ip)->z_seq++; |
There was a problem hiding this comment.
(It was these cases and others that made me wonder about the locking in zpl_page_mkwrite(), but like I say, didn't study them all).
There was a problem hiding this comment.
All the z_seq bumps are atomic now, including these xattr/ACL ones.
Commit 312bdab advertises STATX_ATTR_CHANGE_MONOTONIC and builds the NFSv4 change_cookie from (ctime.tv_sec << 32) | zp->z_seq. zp->z_seq is reset to a magic constant in zfs_znode_alloc(), so any event that drops the znode from cache (memory pressure, remount, reboot) regresses the lower bits of the cookie, a backward step within the same second. NFSv4 clients that trust this contract treat a regressed cookie as evidence that the file's metadata cannot be relied on. VMware ESXi over NFSv4.1 surfaces this as "The file specified is not a virtual disk", and a VM stored on the affected NFS-exported ZFS dataset fails to power on. Widen z_seq to 64 bit and present it directly as the change_cookie, dropping the ctime packing, so the cookie is a single monotonic counter that no longer depends on the clock. FreeBSD's va_filerev consumer also takes the wider value. Persist z_seq via a new SA attribute SA_ZPL_SEQ. An in-core marker zp->z_has_seq records whether the file already carries SA_ZPL_SEQ in its layout; it is derived at load time and never stored on disk, so no global pflag bit is consumed. ZFS_SEQ_MAY_GROW() keys off the marker to grow the SA layout only on the first add per file; ZFS_PERSIST_SEQ() then sets the marker and adds SEQ to the caller's bulk alongside the file's other SA attributes. zfs_znode_alloc() restores z_seq from SA_ZPL_SEQ when present and sets the marker; zfs_rezget() recomputes the marker in place on rollback/recv without disturbing the in-core z_seq, keeping the cookie monotonic. A file written before this change carries no SA_ZPL_SEQ; on Linux it is seeded with (ctime.tv_sec + 1) << 32 so the counter starts above any pre-change cookie and stays monotonic across the upgrade. A missing attribute is simply treated as not-yet-migrated, not an error. FreeBSD never folded ctime into va_filerev, so it needs no seed. No feature flag or on-disk format change is needed: the new SA attribute is keyed by name, so an implementation that does not know it preserves it opaquely, and the first modify lazily migrates each file. Covers both the Linux and FreeBSD ZPL. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Growing a file with fallocate updated its size but left mtime/ctime unchanged and didn't log the change. A fallocate that changes the file size should update mtime/ctime, and the change should be logged so it survives a crash. Pass log=TRUE to zfs_freesp() on the extend path so it updates the timestamps and logs the size change, matching zfs_space(). Punch-hole and zero-range already use this path and are unaffected. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
8f193da to
7d01e1b
Compare
| boolean_t fuid_dirtied = B_FALSE; | ||
| boolean_t handle_eadir = B_FALSE; | ||
| sa_bulk_attr_t bulk[7], xattr_bulk[7]; | ||
| sa_bulk_attr_t bulk[9], xattr_bulk[7]; |
There was a problem hiding this comment.
It looks to me like xattr_bulk is actually slightly oversized here.
| sa_bulk_attr_t bulk[9], xattr_bulk[7]; | |
| sa_bulk_attr_t bulk[9], xattr_bulk[6]; |
These sizes look right, but there are so many conditionals it's hard to see. To make sure there are no future overflows how about adding a few asserts either after out2:, or better yet before the relevant sa_bulk_update().
ASSERT3S(count, <=, 9);
ASSERT3S(xattr_count, <=, 6);
Motivation and Context
Commit 312bdab advertises
STATX_ATTR_CHANGE_MONOTONICto knfsd and builds the NFSv4change_cookiefrom(ctime.tv_sec << 32) | zp->z_seq.zp->z_seqis reset to a magic constant inzfs_znode_alloc(), so any event that drops the znode from cache (memory pressure, remount, reboot) brings the file back with the samectime.tv_secupper bits but a smallerz_seqin the lower bits, regressing the cookie within the same second.NFSv4 clients that trust the monotonicity contract treat this as metadata they cannot rely on. VMware ESXi over NFSv4.1 reliably reproduces it with The file specified is not a virtual disk, causing a VM stored on the affected ZFS dataset to fail to power on.
Description
Persist
zp->z_seqvia a new SA attributeSA_ZPL_SEQso it survives znode eviction. A new pflag bitZFS_HAS_SEQmarks the file as carryingSA_ZPL_SEQin its layout, mirroring the existingZPL_PROJID/ZFS_PROJIDpattern. The bit gatesmay_growat SA tx-hold sites, choosingB_TRUEon the first add per file andB_FALSEthereafter, so steady-state operations pay no extra reservation.A
ZFS_PERSIST_SEQ()macro capturesz_seqand sets the bit into the caller's bulk in one step, persisting both atomically alongside the file's other SA attributes. Every site that bumpsz_sequses it.zfs_znode_alloc()restoresz_seqfromSA_ZPL_SEQwhen the bit is set.No on-disk format change requiring a feature flag is needed. Older binaries preserve the new attribute and bit opaquely. The first modify by a patched binary lazily migrates each file.
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by.