Skip to content

true async io in linux#18684

Open
tiehexue wants to merge 7 commits into
openzfs:masterfrom
tiehexue:true_async
Open

true async io in linux#18684
tiehexue wants to merge 7 commits into
openzfs:masterfrom
tiehexue:true_async

Conversation

@tiehexue

Copy link
Copy Markdown
Contributor

Motivation and Context

This is for #18660 .

Description

zio_nowait is async, so, we just wrap it to dmu, vfs, and zpl. Both read/write have a callback, and write has two callbacks in a chain for data and metadata. Metadata callback is issued from system taskq.

To make this PR production ready, I need help for code review, design review and test in real hardware.

It works fine in ubuntu 26, and I tried to ensure works for linux distribs in the CI. Below is a simple test in an local vm, when iodepth increase, we see better IOPS.

wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randread --bs=128k --size=2G \
  --direct=1 --ioengine=libaio --iodepth=64 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randread, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64
fio-3.41
Starting 1 process

buf-read: (groupid=0, jobs=1): err= 0: pid=143880: Wed Jun 17 16:21:10 2026
  read: IOPS=21.8k, BW=2731MiB/s (2863MB/s)(2048MiB/750msec)
    slat (nsec): min=1875, max=3263.9k, avg=6356.67, stdev=42458.30
    clat (usec): min=597, max=16225, avg=2864.22, stdev=852.05
     lat (usec): min=600, max=19489, avg=2870.57, stdev=861.45
    clat percentiles (usec):
     |  1.00th=[ 1139],  5.00th=[ 1500], 10.00th=[ 1778], 20.00th=[ 2147],
     | 30.00th=[ 2409], 40.00th=[ 2671], 50.00th=[ 2900], 60.00th=[ 3064],
     | 70.00th=[ 3261], 80.00th=[ 3556], 90.00th=[ 3884], 95.00th=[ 4146],
     | 99.00th=[ 4883], 99.50th=[ 5080], 99.90th=[ 6325], 99.95th=[ 8356],
     | 99.99th=[15270]
   bw (  MiB/s): min= 2710, max= 2710, per=99.25%, avg=2710.08, stdev= 0.00, samples=1
   iops        : min=21680, max=21680, avg=21680.00, stdev= 0.00, samples=1
  lat (usec)   : 750=0.03%, 1000=0.38%
  lat (msec)   : 2=15.39%, 4=76.77%, 10=7.39%, 20=0.05%
  cpu          : usr=2.54%, sys=19.89%, ctx=1314, majf=0, minf=525
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=16384,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=2731MiB/s (2863MB/s), 2731MiB/s-2731MiB/s (2863MB/s-2863MB/s), io=2048MiB (2147MB), run=750-750msec
wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randread --bs=128k --size=2G \
  --direct=0 --ioengine=libaio --iodepth=64 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randread, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64
fio-3.41
Starting 1 process
Jobs: 1 (f=1): [r(1)][85.7%][r=415MiB/s][r=3318 IOPS][eta 00m:01s]
buf-read: (groupid=0, jobs=1): err= 0: pid=143904: Wed Jun 17 16:21:27 2026
  read: IOPS=2858, BW=357MiB/s (375MB/s)(2048MiB/5731msec)
    slat (usec): min=13, max=158187, avg=348.37, stdev=1241.85
    clat (usec): min=478, max=187533, avg=21962.99, stdev=11328.68
     lat (usec): min=607, max=187847, avg=22311.37, stdev=11437.37
    clat percentiles (msec):
     |  1.00th=[   16],  5.00th=[   17], 10.00th=[   18], 20.00th=[   18],
     | 30.00th=[   19], 40.00th=[   20], 50.00th=[   21], 60.00th=[   22],
     | 70.00th=[   22], 80.00th=[   23], 90.00th=[   27], 95.00th=[   33],
     | 99.00th=[   43], 99.50th=[   48], 99.90th=[  188], 99.95th=[  188],
     | 99.99th=[  188]
   bw (  KiB/s): min=251912, max=439040, per=98.40%, avg=360073.73, stdev=60330.77, samples=11
   iops        : min= 1968, max= 3430, avg=2812.91, stdev=471.52, samples=11
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.04%, 4=0.04%, 10=0.16%, 20=41.89%, 50=57.45%
  lat (msec)   : 100=0.01%, 250=0.38%
  cpu          : usr=0.56%, sys=22.53%, ctx=16152, majf=43, minf=1140
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=16384,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=357MiB/s (375MB/s), 357MiB/s-357MiB/s (375MB/s-375MB/s), io=2048MiB (2147MB), run=5731-5731msec
wy@u26:~$ echo 0 | sudo tee /sys/module/zfs/parameters/zfs_async_dio_enabled
0
wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randread --bs=128k --size=2G \
  --direct=1 --ioengine=libaio --iodepth=64 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randread, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64
fio-3.41
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=518MiB/s][r=4145 IOPS][eta 00m:00s]
buf-read: (groupid=0, jobs=1): err= 0: pid=143941: Wed Jun 17 16:21:43 2026
  read: IOPS=3984, BW=498MiB/s (522MB/s)(2048MiB/4112msec)
    slat (usec): min=70, max=1536, avg=249.75, stdev=73.10
    clat (nsec): min=1625, max=25795k, avg=15751343.44, stdev=1374828.08
     lat (usec): min=182, max=26371, avg=16001.09, stdev=1392.29
    clat percentiles (usec):
     |  1.00th=[13698],  5.00th=[14222], 10.00th=[14484], 20.00th=[14746],
     | 30.00th=[15008], 40.00th=[15270], 50.00th=[15664], 60.00th=[15926],
     | 70.00th=[16188], 80.00th=[16712], 90.00th=[17171], 95.00th=[17695],
     | 99.00th=[20841], 99.50th=[21627], 99.90th=[22152], 99.95th=[23987],
     | 99.99th=[25560]
   bw (  KiB/s): min=447232, max=536064, per=99.56%, avg=507743.50, stdev=29912.62, samples=8
   iops        : min= 3494, max= 4188, avg=3966.62, stdev=233.67, samples=8
  lat (usec)   : 2=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.02%, 4=0.06%, 10=0.18%, 20=98.56%, 50=1.14%
  cpu          : usr=0.66%, sys=7.86%, ctx=16384, majf=0, minf=526
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=16384,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=498MiB/s (522MB/s), 498MiB/s-498MiB/s (522MB/s-522MB/s), io=2048MiB (2147MB), run=4112-4112msec
wy@u26:~$ 

How Has This Been Tested?

Tested in ubuntu 26, personal zfs fork CI.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter

Copy link
Copy Markdown
Contributor

Wow nice speedup! 👍

Here's the condensed version of your results:

DIO randread + async DIO enabled
  read: IOPS=21.8k, BW=2731MiB/s (2863MB/s)(2048MiB/750msec)

DIO randread + async DIO disabled
  read: IOPS=3984, BW=498MiB/s (522MB/s)(2048MiB/4112msec)

I tried to replicate your test on a single NVMe drive pool (all default pool settings) and got similar results:

sudo ./zpool export t && sudo ./zpool import t && fio --filename=/t/fs1/testfile --rw=randread --bs=128k --size=2G --direct=1 --ioengine=libaio --iodepth=64 --runtime=10  --time_based=1 --group_reporting --name=buf-read

DIO randread + async DIO enabled
  read: IOPS=26.9k, BW=3357MiB/s (3521MB/s)(32.8GiB/10002msec)

DIO randread + async DIO disabled
  read: IOPS=4354, BW=544MiB/s (571MB/s)(5443MiB/10001msec)

I also confirmed in iotop and zpool iostat that it really was reading at that speed:

              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
t           10.0G  1.44T  9.45K      0  1.18G      0
t           10.0G  1.44T  26.0K      0  3.25G      0
t           10.0G  1.44T  26.2K      0  3.28G      0
t           10.0G  1.44T  26.2K      0  3.28G      0

Note, I did try the test with random writes and didn't really see a performance difference:

DIO randwrite + async DIO disabled
  write: IOPS=2254, BW=282MiB/s (296MB/s)(2819MiB/10001msec); 0 zone resets

DIO randwrite + async DIO enabled
  write: IOPS=1993, BW=249MiB/s (261MB/s)(2492MiB/10001msec); 0 zone resets

@tiehexue

tiehexue commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@tonyhutter

Yes, write is not good enough. However, if you try set iodepth from 1 to 8, to 64, with async, we can see a multiply. In my loop devices iodepth=8 hits the peek peformance.

wy@u26:~$  echo 1 | sudo tee /sys/module/zfs/parameters/zfs_async_dio_enabled
1
wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randwrite --bs=128k --size=1G --direct=1 --ioengine=libaio --iodepth=1 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=1
fio-3.41
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=263MiB/s][w=2105 IOPS][eta 00m:00s]
buf-read: (groupid=0, jobs=1): err= 0: pid=23374: Thu Jun 18 02:17:38 2026
  write: IOPS=2099, BW=262MiB/s (275MB/s)(1024MiB/3902msec); 0 zone resets

wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randwrite --bs=128k --size=1G --direct=1 --ioengine=libaio --iodepth=8 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=8
fio-3.41
Starting 1 process
Jobs: 1 (f=1)
buf-read: (groupid=0, jobs=1): err= 0: pid=23403: Thu Jun 18 02:17:48 2026
  write: IOPS=3972, BW=497MiB/s (521MB/s)(1024MiB/2062msec); 0 zone resets

wy@u26:~$ echo 0 | sudo tee /sys/module/zfs/parameters/zfs_async_dio_enabled
0
wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randwrite --bs=128k --size=1G --direct=1 --ioengine=libaio --iodepth=1 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=1
fio-3.41
Starting 1 process
Jobs: 1 (f=1): [w(1)][-.-%][w=333MiB/s][w=2666 IOPS][eta 00m:00s]
buf-read: (groupid=0, jobs=1): err= 0: pid=23443: Thu Jun 18 02:18:13 2026
  write: IOPS=2613, BW=327MiB/s (343MB/s)(1024MiB/3135msec); 0 zone resets

wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randwrite --bs=128k --size=1G --direct=1 --ioengine=libaio --iodepth=8 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=8
fio-3.41
Starting 1 process
Jobs: 1 (f=1): [w(1)][-.-%][w=330MiB/s][w=2636 IOPS][eta 00m:00s]
buf-read: (groupid=0, jobs=1): err= 0: pid=23471: Thu Jun 18 02:18:23 2026
  write: IOPS=2657, BW=332MiB/s (348MB/s)(1024MiB/3083msec); 0 zone resets

@viniciusferrao viniciusferrao left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be direct, but have you tested that? I'm testing it now and already found issues with memory mapped buffered writes. I'm no expert in that and trying to figure out but the logic is flaky.

Again, not trying to be a douche but that was LLM generated right? That's fine, I also generate lots of code with it but it should be tested throughful to be valid.

I still have nightmares with async IO, main due to how ZFS implements the DMU. This is a chapter that comes over and over again and have so many corner cases that we cannot easily antecipate them.

I accumulated some knowledge in the issue for like 5 years now, but again I'm no expert. I'm also interested in NVMe performance not only on Linux but also in FreeBSD. But I think this patch is not up to the standards for the project. Yes I cannot do it myself because I may lack the skills, but there's too many assumptions like the "double the pages" that screamed like LLM generated code for me.

Comment thread module/os/linux/zfs/zfs_uio.c Outdated
Comment thread module/os/linux/zfs/zfs_uio.c Outdated
Comment thread module/os/linux/zfs/zpl_file.c
Comment thread module/zfs/dmu_direct.c Outdated
Comment thread module/os/linux/zfs/zpl_file.c Outdated
Comment thread module/zfs/zfs_vnops.c Outdated
Comment thread module/zfs/zfs_vnops.c Outdated
@tonyhutter

Copy link
Copy Markdown
Contributor

@tiehexue I'm currently testing this PR in my local CI with zfs_async_dio_enabled manually set to 1 by default. That way all the tests are running with async DIO on. When I do that, I'm seeing a lot of killed tests.

I tried running ZTS on my local VM (again, with zfs_async_dio_enable=1) and hit this on one of the io tests:

Jun 17 17:46:15 fedora42 kernel: VERIFY3U(off + size, <=, sabd->abd_size) failed (131072 <= 32768)
Jun 17 17:46:15 fedora42 kernel: PANIC at abd.c:612:abd_get_offset_size()
Jun 17 17:46:15 fedora42 kernel: Showing stack for process 45495
Jun 17 17:46:15 fedora42 kernel: CPU: 6 UID: 0 PID: 45495 Comm: fio Tainted: P           OE       6.19.14-100.fc42.x86_64 #1 PREEMPT(lazy) 
Jun 17 17:46:15 fedora42 kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Jun 17 17:46:15 fedora42 kernel: Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+19570+14a90618 04/01/2014
Jun 17 17:46:15 fedora42 kernel: Call Trace:
Jun 17 17:46:15 fedora42 kernel:  <TASK>
Jun 17 17:46:15 fedora42 kernel:  dump_stack_lvl+0x5d/0x80
Jun 17 17:46:15 fedora42 kernel:  spl_panic+0xf5/0x11a [spl]
Jun 17 17:46:15 fedora42 kernel:  ? __pfx_zfs_async_write_complete+0x10/0x10 [zfs]
Jun 17 17:46:15 fedora42 kernel:  abd_get_offset_size+0x64/0x80 [zfs]
Jun 17 17:46:15 fedora42 kernel:  dmu_write_abd_async+0x158/0x230 [zfs]
Jun 17 17:46:15 fedora42 kernel:  zfs_write_async+0x516/0x950 [zfs]
Jun 17 17:46:15 fedora42 kernel:  ? __pfx_zfs_async_write_complete+0x10/0x10 [zfs]
Jun 17 17:46:15 fedora42 kernel:  zpl_iter_write+0x146/0x2d0 [zfs]
Jun 17 17:46:15 fedora42 kernel:  aio_write+0x15b/0x290
Jun 17 17:46:15 fedora42 kernel:  ? fget+0x73/0xa0
Jun 17 17:46:15 fedora42 kernel:  ? io_submit_one+0xef/0x3a0
Jun 17 17:46:15 fedora42 kernel:  io_submit_one+0xef/0x3a0
Jun 17 17:46:15 fedora42 kernel:  __x64_sys_io_submit+0x94/0x1f0
Jun 17 17:46:15 fedora42 kernel:  do_syscall_64+0x7e/0x690
Jun 17 17:46:15 fedora42 kernel:  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
Jun 17 17:46:15 fedora42 kernel:  ? do_syscall_64+0xbb/0x690
Jun 17 17:46:15 fedora42 kernel:  ? do_io_getevents+0x8d/0xe0
Jun 17 17:46:15 fedora42 kernel:  ? __x64_sys_io_getevents+0x77/0xe0
Jun 17 17:46:15 fedora42 kernel:  ? count_memcg_events+0xd6/0x210
Jun 17 17:46:15 fedora42 kernel:  ? do_syscall_64+0xbb/0x690
Jun 17 17:46:15 fedora42 kernel:  ? handle_mm_fault+0x212/0x340
Jun 17 17:46:15 fedora42 kernel:  ? do_user_addr_fault+0x1d9/0x7b0
Jun 17 17:46:15 fedora42 kernel:  ? clear_bhb_loop+0x50/0xa0
Jun 17 17:46:15 fedora42 kernel:  ? clear_bhb_loop+0x50/0xa0
Jun 17 17:46:15 fedora42 kernel:  ? clear_bhb_loop+0x50/0xa0
Jun 17 17:46:15 fedora42 kernel:  ? clear_bhb_loop+0x50/0xa0
Jun 17 17:46:15 fedora42 kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jun 17 17:46:15 fedora42 kernel: RIP: 0033:0x7f729859310d
Jun 17 17:46:15 fedora42 kernel: Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 3c 0f 00 f7 d8 64 89 01 48
Jun 17 17:46:15 fedora42 kernel: RSP: 002b:00007ffd9774bc88 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
Jun 17 17:46:15 fedora42 kernel: RAX: ffffffffffffffda RBX: 00007f729849d6d0 RCX: 00007f729859310d
Jun 17 17:46:15 fedora42 kernel: RDX: 0000564864505670 RSI: 0000000000000001 RDI: 00007f7290090000
Jun 17 17:46:15 fedora42 kernel: RBP: 00007ffd9774bcc0 R08: 00005648645031a0 R09: 00005648645030ac
Jun 17 17:46:15 fedora42 kernel: R10: 00005648644faf00 R11: 0000000000000246 R12: 00007f7290090000
Jun 17 17:46:15 fedora42 kernel: R13: 0000000000000000 R14: 0000564864505670 R15: 0000000000000001
Jun 17 17:46:15 fedora42 kernel:  </TASK>

So I'd recommend you get ZTS to pass in your local VM with zfs_async_dio_enable=1. Make sure you build with ./configure --enable-debug to turn on the asserts.

@tiehexue

Copy link
Copy Markdown
Contributor Author

@tiehexue I'm currently testing this PR in my local CI with zfs_async_dio_enabled manually set to 1 by default. That way all the tests are running with async DIO on. When I do that, I'm seeing a lot of killed tests.

I tried running ZTS on my local VM (again, with zfs_async_dio_enable=1) and hit this on one of the io tests:

Jun 17 17:46:15 fedora42 kernel: VERIFY3U(off + size, <=, sabd->abd_size) failed (131072 <= 32768)
Jun 17 17:46:15 fedora42 kernel: PANIC at abd.c:612:abd_get_offset_size()
Jun 17 17:46:15 fedora42 kernel: Showing stack for process 45495
Jun 17 17:46:15 fedora42 kernel: CPU: 6 UID: 0 PID: 45495 Comm: fio Tainted: P           OE       6.19.14-100.fc42.x86_64 #1 PREEMPT(lazy) 
Jun 17 17:46:15 fedora42 kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Jun 17 17:46:15 fedora42 kernel: Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+19570+14a90618 04/01/2014
Jun 17 17:46:15 fedora42 kernel: Call Trace:
Jun 17 17:46:15 fedora42 kernel:  <TASK>
Jun 17 17:46:15 fedora42 kernel:  dump_stack_lvl+0x5d/0x80
Jun 17 17:46:15 fedora42 kernel:  spl_panic+0xf5/0x11a [spl]
Jun 17 17:46:15 fedora42 kernel:  ? __pfx_zfs_async_write_complete+0x10/0x10 [zfs]
Jun 17 17:46:15 fedora42 kernel:  abd_get_offset_size+0x64/0x80 [zfs]
Jun 17 17:46:15 fedora42 kernel:  dmu_write_abd_async+0x158/0x230 [zfs]
Jun 17 17:46:15 fedora42 kernel:  zfs_write_async+0x516/0x950 [zfs]
Jun 17 17:46:15 fedora42 kernel:  ? __pfx_zfs_async_write_complete+0x10/0x10 [zfs]
Jun 17 17:46:15 fedora42 kernel:  zpl_iter_write+0x146/0x2d0 [zfs]
Jun 17 17:46:15 fedora42 kernel:  aio_write+0x15b/0x290
Jun 17 17:46:15 fedora42 kernel:  ? fget+0x73/0xa0
Jun 17 17:46:15 fedora42 kernel:  ? io_submit_one+0xef/0x3a0
Jun 17 17:46:15 fedora42 kernel:  io_submit_one+0xef/0x3a0
Jun 17 17:46:15 fedora42 kernel:  __x64_sys_io_submit+0x94/0x1f0
Jun 17 17:46:15 fedora42 kernel:  do_syscall_64+0x7e/0x690
Jun 17 17:46:15 fedora42 kernel:  ? arch_exit_to_user_mode_prepare.isra.0+0xa1/0xc0
Jun 17 17:46:15 fedora42 kernel:  ? do_syscall_64+0xbb/0x690
Jun 17 17:46:15 fedora42 kernel:  ? do_io_getevents+0x8d/0xe0
Jun 17 17:46:15 fedora42 kernel:  ? __x64_sys_io_getevents+0x77/0xe0
Jun 17 17:46:15 fedora42 kernel:  ? count_memcg_events+0xd6/0x210
Jun 17 17:46:15 fedora42 kernel:  ? do_syscall_64+0xbb/0x690
Jun 17 17:46:15 fedora42 kernel:  ? handle_mm_fault+0x212/0x340
Jun 17 17:46:15 fedora42 kernel:  ? do_user_addr_fault+0x1d9/0x7b0
Jun 17 17:46:15 fedora42 kernel:  ? clear_bhb_loop+0x50/0xa0
Jun 17 17:46:15 fedora42 kernel:  ? clear_bhb_loop+0x50/0xa0
Jun 17 17:46:15 fedora42 kernel:  ? clear_bhb_loop+0x50/0xa0
Jun 17 17:46:15 fedora42 kernel:  ? clear_bhb_loop+0x50/0xa0
Jun 17 17:46:15 fedora42 kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Jun 17 17:46:15 fedora42 kernel: RIP: 0033:0x7f729859310d
Jun 17 17:46:15 fedora42 kernel: Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 3c 0f 00 f7 d8 64 89 01 48
Jun 17 17:46:15 fedora42 kernel: RSP: 002b:00007ffd9774bc88 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
Jun 17 17:46:15 fedora42 kernel: RAX: ffffffffffffffda RBX: 00007f729849d6d0 RCX: 00007f729859310d
Jun 17 17:46:15 fedora42 kernel: RDX: 0000564864505670 RSI: 0000000000000001 RDI: 00007f7290090000
Jun 17 17:46:15 fedora42 kernel: RBP: 00007ffd9774bcc0 R08: 00005648645031a0 R09: 00005648645030ac
Jun 17 17:46:15 fedora42 kernel: R10: 00005648644faf00 R11: 0000000000000246 R12: 00007f7290090000
Jun 17 17:46:15 fedora42 kernel: R13: 0000000000000000 R14: 0000564864505670 R15: 0000000000000001
Jun 17 17:46:15 fedora42 kernel:  </TASK>

So I'd recommend you get ZTS to pass in your local VM with zfs_async_dio_enable=1. Make sure you build with ./configure --enable-debug to turn on the asserts.

ok, I missed this. I run async test only in local, plus some mannual test, and verified in personal zfs fork CI, but not set zfs_async_dio_enable=1 in all the ZTS.

@tiehexue

Copy link
Copy Markdown
Contributor Author

@tonyhutter @amotin @viniciusferrao Thanks for all your review. One missing point as @tonyhutter said, I should run ZTS with zfs_async_dio_enable=1. Let me do it and come back (may need sometime). I would be happy, if you could keep reviewing/testing/working based on this PR (but better to wait I come back with ZTS passed with zfs_async_dio_enable).

And I am glad about that @viniciusferrao questioned LLM involving. Yes, most of code, tests are written by copliot+deepseek-v4-pro. The most amazing thing is that it just works! For #18661 and this PR and others I created for deadlock, data corruption, LLM (prompted by me) can understand very broadly about the codebase, issues, panic stack trace, and solves them. And most important, the generated code is not too much to be reviewed.

I am curiously about how do the upstream team review/re-test bigger changes like this and #18661 , #18679 , #18647 , #18660 , #18620 . Especially this one, should there be several RC cycles?

@adamdmoss

Copy link
Copy Markdown
Contributor

I'm a bit upset that this intrinsically licensially-problematic LLM extrusion was thrown into the PR pool and then left to reviewers to grok and prove correctness; the world's petabytes of data integrity is in your hands. This does not feel like responsible engineering.

On the other hand, a tempting performance win is a tempting performance win (IFF demonstrably correct, see above).

It'll be an interesting test of the system.

@tiehexue

Copy link
Copy Markdown
Contributor Author

I'm a bit upset that this intrinsically licensially-problematic LLM extrusion was thrown into the PR pool and then left to reviewers to grok and prove correctness; the world's petabytes of data integrity is in your hands. This does not feel like responsible engineering.

On the other hand, a tempting performance win is a tempting performance win (IFF demonstrably correct, see above).

It'll be an interesting test of the system.

Like zfs, LLM are created by best of IT guys for the world, for you and for me. For async io in zfs, I need the upstream team heavily to involve as you should understand. And I will do my best to push this to high quality. Let us focus on the code (no birth prejudice).

The amazing part here is that async IO in zfs is doable, may take some time. I will be more than excited if another PR gotten merged for it.

@robn

robn commented Jun 18, 2026

Copy link
Copy Markdown
Member

Like zfs, LLM are created by best of IT guys for the world, for you and for me.

Careful. This is very close to reading like "you should be grateful", which is unlikely to win me over (at least).

The amazing part here is that async IO in zfs is doable, may take some time.

It's not amazing. It's well known that better async performance and facilities are possible, and many people have been putting in the careful, meticulous work over many many years to make it happen. Some, like #10018, made it over the line after years; others, like #10425 (and its predecessors), did not.

Maybe you can claim that large, invasive changes wouldn't take so long if there was an LLM involved. Ask me again when one of those land.

I am curiously about how do the upstream team review/re-test bigger changes like this and #18661, #18679, #18647, #18660, #18620. Especially this one, should there be several RC cycles?

In my own experience, the best chance you have to get something difficult merged is to go out of your way to make it easy for a reviewer to say yes. Be meticulous in your testing and your descriptions. Break it up into multiple commits or multiple PRs, each with good commit messages and comments. Go out of your way to show that you understand reviewers have very limited time. But also, understand they all want to say yes!

I'll also say, this has nothing to do with LLMs. We don't (currently) have an AI policy; I personally am content to keep it that way. I have multiple LLM-generated branches and experiments here, and they're all very interesting, and a couple have even identified a way to solve a problem that I couldn't previously get my head around. And yet, all of them will never see light of day in that form, because there's multiple things in there that I would not be comfortable staking my own reputation on. So they sit there until I use them as a guide to reimplement the thing from scratch, my way. In that same spirit, I review LLM-based contributions based on whether I believe the person putting them forward could have written them without assistance. Or, put another way, whether they can explain what they've done and how it works.

Anyway, that's enough from me on that; this is hardly the place anyway. On this one, and the others, get the tests passing and show you understand how it works and show reasonable confidence that it isn't going to lose data, and I'm more than interested! Or, route around me; ultimately I'm just a guy, there's other people here who can review and approve changes too, and they have different priorities or standards to me. And that's ok too, because I trust them :)

@ryao

ryao commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

There is a great deal to review here, and I do not have the time to review it all, so I am only commenting on LLMs and contributing in general.

LLMs are useful tools, but at their core, they are a very high-level autocomplete. While they sometimes surprise me, I find that they often do not have the same idea for how things should work as I do. Sometimes I simply did not provide enough information, and they did not think to ask for clarity on the omission. Other times, they simply have their own ideas that are not always correct. As a general rule of thumb, for any non-trivial change, an LLM probably did several things that need manual correction.

In general, before submitting a PR, the author should review every line of code in context for correctness and either have reasonable certainty that it is correct or state that it is a work-in-progress (WIP). That means looking at the surrounding code, examining functions that call it for possible unforeseen side effects, etc. This applies to all commits, whether or not an LLM was used to assist in writing them.

Outside of trivial commits, the first thing people make that works looks very different from what they actually put into the final PR. Most of us hide that (since the intermediate stages of development are not what we want in the repository). A few (like the BRT changes) had a much more public development process. There have been times when I went through multiple rounds of reviewing my own commits and making revisions before making a PR, since, while things seem good while working on them, artifacts often make it into the end result that are not quite right. I have even been known to spot issues shortly after making a PR, leading to multiple rounds of revisions from last-minute nitpicking over correctness.

Finally, it is okay to submit PRs that are a WIP. Not everyone has the time to set up their own test infrastructure, and it can be genuinely useful to lean on the project's infrastructure to test things. That said, such PRs should always be marked as such.

I'm a bit upset that this intrinsically licensially-problematic LLM extrusion was thrown into the PR pool

For anything non-trivial, it is likely a unique work, so I am not sure how a licensing issue could arise. If it is trivial, it is probably not a unique work, but triviality takes precedence. That said, some projects are more cautious here, but I am not sure if there is any need. The technology at its core is a very clever auto-complete (studying supersonic13/llama3.c makes this very clear). As long as what you are doing with it is unique, it should produce a unique output. Since it learned in a manner that emulates human learning, the unique output should be a transformative use of what it learned, just like what humans produce is a transformative use of what we learned before:

https://en.wikipedia.org/wiki/Transformative_use

For example, this entire response, minus quotations where I have fair use, is not just unique but is also a transformative use of all of the English writing I have experienced in my life. I am thus free to license it under any license I wish.

For async io in zfs, I need the upstream team heavily to involve as you should understand. And I will do my best to push this to high quality.

The community is often thrilled to help mentor people. I myself started as a student who knew relatively little, but learned from feedback. Developing a filesystem is a slow process to minimize the possibility of introducing regressions. For non-trivial changes, one often must go through multiple rounds of writing, testing, and self-reviewing in response to comments from other contributors. Part of this is to ensure quality. If you submit your best effort where you cannot see any way to improve it, you often get higher-quality responses that help make it even better.

Another part of it is that filesystem development is very hard. Erez Zadok once said that filesystem development is harder than rocket science, since a rocket only needs to work once while a filesystem must work all the time, in millions of places, 24x7. Now, with SpaceX, perhaps rocket science is catching up, but that really underscores how hard making code that works every time is. Also, despite our best efforts, mistakes still get past us from time to time.

We don't (currently) have an AI policy; I personally am content to keep it that way.

Agreed. I see no reason to limit the tools people use as long as they take responsibility for the output. To give an example, the other day at work, I used a LLM to refactor a 50+ line function in a specific way. I could have done it myself, but it was faster to tell a LLM what I planned to do and let it make the changes. It output an improved function. I then reviewed every part of it in detail to verify it was what I had intended to produce. This is something I would have done even if I had written it myself, since it is easy to write something that has mistakes that are caught upon review.

@tiehexue

Copy link
Copy Markdown
Contributor Author

@robn @ryao thanks for all your sharing, I may not quite catch up, but I have a feel that we share same principles.

tiehexue added 4 commits June 21, 2026 15:57
This is true async linux only when
using libaio, io_uring, etc. We
already has true async in zio_nowait,
so wrap it in dmu, vfs, zpl layer.
It works well with multple times
faster, refer to newly added tests
for more information.

Signed-off-by: tiehexue <tiehexue@hotmail.com>
Signed-off-by: tiehexue <tiehexue@hotmail.com>
Signed-off-by: tiehexue <tiehexue@hotmail.com>
zfs_setup_direct is not called in async
write path as we did in async read
path, because pin pages need wait "a long
time in write path which will cause
FOLL_LONGTERM hitting kernel limit.

Signed-off-by: tiehexue <tiehexue@hotmail.com>
@tiehexue

Copy link
Copy Markdown
Contributor Author

@tonyhutter Hi, I got problem to set up running ZTS locally, a lot kernel panics not related this PR. So I enabled async by default in the code in another branch (not in upstream PR), and run CI in my personal zfs fork, 9 of 12 linux os finished with green, the failed three looked not related to this PR.
截屏2026-06-21 16 35 55

tiehexue added 3 commits June 21, 2026 23:08
It should be misleading in some OS
crash to pin 2x pages. Now by adding
more gates, this should be revert.

Signed-off-by: tiehexue <tiehexue@hotmail.com>
io_uring may not exist in older kernel like
4.18, just let test case PASS and leave a
note, unless the CI failed unexpectedly.

Signed-off-by: tiehexue <tiehexue@hotmail.com>
This feature is only available in linux platform.

Signed-off-by: tiehexue <tiehexue@hotmail.com>
@tiehexue

tiehexue commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Hi @amotin @tonyhutter @viniciusferrao , I resolved issues, verified by CI with both zfs_async_dio_enabled set to 1 and 0 for all ZTS. And @AntonHPE provide test results in real hardware in #18660 . 5x faster on randread and 50% faster on randwrite, but also much slower than xfs+zvol. This nearly kill my passion to this PR.

So I have a question. Assume same setup, same test, if xfs+zvol have e.g. 300K IOPS, especially write IOPS, what is the most theoretically number for zfs?

If the number is performant enough compare to xfs+zvol, I would move on. If there is a limit (becasue zfs has much more robust data integrity?), we must think in other way.

How do we calculate the number?

Suppose:

zfs create -o internal_fs=xfs -o fs_limit=1T tank/fs

And we actually create it with xfs+zvol. Is this still a zfs filesystem? With zvol have a lot nice features, and with xfs we have peek performance. What do actually lost? And can we retain most important zfs features plus this kind of performance?

@ryao

ryao commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

I resolved issues, verified by CI with both zfs_async_dio_enabled set to 1 and 0 for all ZTS. And @AntonHPE provide test results in real hardware in #18660 . 5x faster on randread and 50% faster on randwrite, but also much slower than xfs+zvol. This nearly kill my passion to this PR.

Improving performance is rarely just a matter of improving 1 thing. Performance improvements like those are genuinely amazing, and if this gets through the review process, future performance work will likely build on it.

@AntonHPE

Copy link
Copy Markdown

It might be useful for further investigation. I created zvol (without xfs or zfs filesystems on top of zvol)

I re-created zvol from scratch.

[root@memverge4 anton]# zpool create -f -o ashift=12 -O recordsize=16K -O atime=off -O xattr=sa -O compression=off -O dedup=off -o autotrim=on tank raidz2 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BQ0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BS0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BT0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BW0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BX0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZEE0A09A0C93
[root@memverge4 anton]#
[root@memverge4 anton]# zfs create -V 128G -o volblocksize=16K -o compression=off -o dedup=off tank/fiotest
[root@memverge4 anton]#
[root@memverge4 anton]# echo 1 > /sys/module/zfs/parameters/zvol_dio_enabled
[root@memverge4 anton]# cat /sys/module/zfs/parameters/zvol_dio_enabled
1
[root@memverge4 anton]#
[root@memverge4 anton]# fio --name=test --rw=randread --bs=16k --filename=/dev/zvol/tank/fiotest --direct=1 --numjobs=1 --iodepth=64 --exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-67-gfb5b
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=3910MiB/s][r=250k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=128585: Tue Jun 23 14:27:17 2026
read: IOPS=261k, BW=4074MiB/s (4272MB/s)(239GiB/60001msec)
slat (nsec): min=491, max=214903, avg=1884.48, stdev=2760.97
clat (usec): min=39, max=3588, avg=243.39, stdev=139.97
lat (usec): min=48, max=3589, avg=245.28, stdev=139.69

So, single thread 16KB random reads

pure zvol (direct+async) 260k
zvol + xfs (direct+async) 300k
zvol + zfs (direct+async) 60k

@tiehexue

Copy link
Copy Markdown
Contributor Author

pure zvol (direct+async) 260k zvol + xfs (direct+async) 300k zvol + zfs (direct+async) 60k

Thanks Anton.

This could be explained, zvol has no filesystem overhead. I am building a package with both zvol directio and zfs true async. It would be better you can test again. The zfs part, I made two optimization, one removed metadata compress, the other using batch vdev queue. Will let you know when it is ready.

@tiehexue

Copy link
Copy Markdown
Contributor Author

I resolved issues, verified by CI with both zfs_async_dio_enabled set to 1 and 0 for all ZTS. And @AntonHPE provide test results in real hardware in #18660 . 5x faster on randread and 50% faster on randwrite, but also much slower than xfs+zvol. This nearly kill my passion to this PR.

Improving performance is rarely just a matter of improving 1 thing. Performance improvements like those are genuinely amazing, and if this gets through the review process, future performance work will likely build on it.

Thanks, @ryao . I am re-filled now. Two more optimization are added today, I will ask @AntonHPE to verify at his will, if good result, I will push to this PR. One module parameter to disable metadata compressing, one to batch vdev queue, based on Anton's test result, there are 14% lz4, and 17% spin lock in write path.

@tiehexue

Copy link
Copy Markdown
Contributor Author

@AntonHPE Hi, would you like to see if could download from https://github.com/tiehexue/zfs/releases , where I also provide some instructions about the newly added module parameters.

@AntonHPE

AntonHPE commented Jun 23, 2026 via email

Copy link
Copy Markdown

@AntonHPE

AntonHPE commented Jun 24, 2026 via email

Copy link
Copy Markdown

@tiehexue

tiehexue commented Jun 24, 2026 via email

Copy link
Copy Markdown
Contributor Author

@tiehexue

Copy link
Copy Markdown
Contributor Author

Got it!

This two parameters not beneficial, I will not push to this PR.

zfs_vdev_queue_batch
zfs_mdcomp_enabled

@AntonHPE

Copy link
Copy Markdown

I think no. I'm confused. Let me clarify.

I think misunderstanding comes from the beginning.

Initially I post the request - #18644
In this request we discussed sequential read/write only with bs=1m and 2m on zvol, no random I/O at all.

Later add another request - #18660
In this request we discussed random read/write only with bs=16K on zfs (and zvol+xfs), no sequential I/O at all.

Let's omit "sequential read/write only with bs=1m and 2m on zvol", as it works perfectly, near limits
Direct I/O works perfectly, review/accept required.

So, while we focused on direct+async, we have two different workloads (random and sequential) and two different targets (zvol and zfs (and zvol+xfs))

I just re-run all, now focus on:

	random read, bs=16K			random write, bs=16K	

zvol 256 kIOPS 180 kIOPS
zfs 60 kIOPS 35 kIOPS
zvol+xfs 168 kIOPS 98 kIOPS

[root@memverge4 anton]# cat /sys/module/zfs/parameters/zvol_dio_enabled
1
[root@memverge4 anton]# cat /sys/module/zfs/parameters/zfs_vdev_queue_batch
1
[root@memverge4 anton]# cat /sys/module/zfs/parameters/zfs_mdcomp_enabled
1
[root@memverge4 anton]# cat /sys/module/zfs/parameters/zfs_async_dio_enabled
1
[root@memverge4 anton]#

@AntonHPE

Copy link
Copy Markdown

@tiehexue

Got it!

This two parameters not beneficial, I will not push to this PR.

zfs_vdev_queue_batch zfs_mdcomp_enabled

Yes, but I'm ready to proceed. Direct+async random I/O should be improved for zfs. I provided perf outputs above, might me helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants