System information
| Type |
Version/Name |
| Distribution Name |
Debian |
| Distribution Version |
13 (trixie) |
| Kernel Version |
6.12.94+deb13-amd64 |
| Architecture |
x86_64 |
| OpenZFS Version |
2.3.2-2 |
Describe the problem you're observing
When the Linux fh_to_dentry() export operation is called for an open-unlinked file on a ZFS filesystem, an ESTALE error is returned. This behavior differs from other filesystem implementations on Linux (like eg. ext4, XFS), which grant access to inodes that are unlinked but still referenced. This deviation can lead to unexpected behavior from callers open_by_handle_at(), or during NFS exports of ZFS filesystems (like #11163 or #6197).
This is due to the following check in zfs_vget():
if (zp->z_unlinked || zp_gen != fid_gen) {
dprintf("znode gen (%llu) != fid gen (%llu)\n", zp_gen,
fid_gen);
zrele(zp);
zfs_exit(zfsvfs, FTAG);
return (SET_ERROR(ENOENT));
}
which has been around since the initial commit in git. It follows a call to zfs_zget() where some related checks have been added and refined over the years. In the current code, once we hit the cited check in zfs_vget(), we can be sure that we hold a reference on the inode, and iput_final() has not yet been invoked. It's not clear to me what the additional check for z_unlinked in zfs_vget() is trying to protect against, and why it shouldn't just hand out the znode instead. Initial tests with a trimmed condition if (zp_gen != fid_gen) seem to work fine, and bring ZFS fh_to_dentry behavior in line with other Linux filesystems.
Similar checks exist in the FreeBSD specifc code, but I'm not sure about the expected behavior on this platform.
Describe how to reproduce the problem
The following reproducer illustrates the difference. It needs to be started with root privileges to create and unlink a file local_testfile.txt in the current working directory. Before the file is unlinked, at handle is obtained, and later used to open the already unlinked file. On (at least) ext4, XFS, btrfs, and tmpfs, the test succeeds. On ZFS, it fails with ESTALE.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <sys/stat.h>
#ifndef MAX_HANDLE_SZ
#define MAX_HANDLE_SZ 128
#define AT_FDCWD -100
#endif
int main() {
const char *filename = "local_testfile.txt";
int mount_fd, file_fd, handle_fd;
struct file_handle *fhp = NULL;
int mnt_id;
int ret = 2;
fhp = malloc(sizeof(struct file_handle) + MAX_HANDLE_SZ);
if (!fhp) {
perror("[-] malloc failed");
goto out_fhp;
}
fhp->handle_bytes = MAX_HANDLE_SZ;
mount_fd = open(".", O_RDONLY | O_DIRECTORY);
if (mount_fd < 0) {
perror("[-] open (mount_fd) failed");
goto out_fhp;
}
file_fd = open(filename, O_CREAT | O_RDWR | O_TRUNC, 0644);
if (file_fd < 0) {
perror("[-] create test file failed");
goto out_mount;
}
printf("[*] file '%s' created and opened (fd=%d).\n", filename, file_fd);
if (name_to_handle_at(AT_FDCWD, filename, fhp, &mnt_id, 0) < 0) {
perror("[-] name_to_handle_at failed");
goto out_file;
}
printf("[*] successfully obtained filehandle from file system.\n");
if (unlink(filename) < 0) {
perror("[-] unlink failed");
goto out_file;
}
printf("[*] file unlinked (dentry deleted, fd reference %d still active).\n", file_fd);
printf("[*] calling open_by_handle_at()...\n");
handle_fd = open_by_handle_at(mount_fd, fhp, O_RDONLY);
if (handle_fd < 0) {
printf("[!] FAIL: open_by_handle_at() failed: errno %d (%s)\n",
errno, strerror(errno));
ret = 1;
} else {
printf("[+] OK: open_by_handle_at() succeeded! (new_fd=%d)\n", handle_fd);
close(handle_fd);
ret = 0;
}
out_file:
close(file_fd);
out_mount:
close(mount_fd);
out_fhp:
free(fhp);
return ret;
}
Include any warning/errors/backtraces from the system logs
The following bfstrace script can be used to track fh_to_dentry calls for inodes on ZFS, and their corresponding z_unlinked state.
#!/usr/bin/env bpftrace
#include <linux/fs.h>
/* Mark FH lookup entry */
kprobe:zfs:zpl_fh_to_dentry
{
@in_fh[tid] = 1;
}
/* Store zpp on entry into zfs_zget */
kprobe:zfs:zfs_zget
/ @in_fh[tid] /
{
@zpp_store[tid] = arg2;
}
/* Obtain Znode pointer on success */
kretprobe:zfs:zfs_zget
/ @in_fh[tid] && @zpp_store[tid] /
{
$zpp = @zpp_store[tid];
if (retval == 0 && $zpp != 0) {
@active_zp[tid] = *(int64 *)$zpp;
}
}
/* Show inode number and unlinked state */
kprobe:zfs:sa_lookup
/ @in_fh[tid] && @active_zp[tid] /
{
$zp = @active_zp[tid];
// Zugriff auf die eingebettete Linux-Inode innerhalb der OpenZFS-Struktur
$inode_num = ((struct znode *)$zp)->z_inode.i_ino;
$z_unlinked = ((struct znode *)$zp)->z_unlinked;
printf("[NFS-ZFS-FH-TRACKER] Inode %lu, z_unlinked %d\n", $inode_num, $z_unlinked);
}
/* Cleanup temporary maps on exit */
kretprobe:zfs:zpl_fh_to_dentry
{
if (@in_fh[tid]) { delete(@in_fh[tid]); }
if (@zpp_store[tid]) { delete(@zpp_store[tid]); }
if (@active_zp[tid]) { delete(@active_zp[tid]); }
}
System information
Describe the problem you're observing
When the Linux
fh_to_dentry()export operation is called for an open-unlinked file on a ZFS filesystem, an ESTALE error is returned. This behavior differs from other filesystem implementations on Linux (like eg. ext4, XFS), which grant access to inodes that are unlinked but still referenced. This deviation can lead to unexpected behavior from callersopen_by_handle_at(), or during NFS exports of ZFS filesystems (like #11163 or #6197).This is due to the following check in
zfs_vget():which has been around since the initial commit in git. It follows a call to
zfs_zget()where some related checks have been added and refined over the years. In the current code, once we hit the cited check inzfs_vget(), we can be sure that we hold a reference on the inode, andiput_final()has not yet been invoked. It's not clear to me what the additional check forz_unlinkedinzfs_vget()is trying to protect against, and why it shouldn't just hand out the znode instead. Initial tests with a trimmed conditionif (zp_gen != fid_gen)seem to work fine, and bring ZFSfh_to_dentrybehavior in line with other Linux filesystems.Similar checks exist in the FreeBSD specifc code, but I'm not sure about the expected behavior on this platform.
Describe how to reproduce the problem
The following reproducer illustrates the difference. It needs to be started with root privileges to create and unlink a file
local_testfile.txtin the current working directory. Before the file is unlinked, at handle is obtained, and later used to open the already unlinked file. On (at least) ext4, XFS, btrfs, and tmpfs, the test succeeds. On ZFS, it fails with ESTALE.Include any warning/errors/backtraces from the system logs
The following bfstrace script can be used to track
fh_to_dentrycalls for inodes on ZFS, and their correspondingz_unlinkedstate.