true async io in linux#18684
Conversation
|
Wow nice speedup! 👍 Here's the condensed version of your results: I tried to replicate your test on a single NVMe drive pool (all default pool settings) and got similar results: I also confirmed in iotop and zpool iostat that it really was reading at that speed: Note, I did try the test with random writes and didn't really see a performance difference: |
|
Yes, write is not good enough. However, if you try set iodepth from 1 to 8, to 64, with async, we can see a multiply. In my loop devices iodepth=8 hits the peek peformance. wy@u26:~$ echo 1 | sudo tee /sys/module/zfs/parameters/zfs_async_dio_enabled
1
wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randwrite --bs=128k --size=1G --direct=1 --ioengine=libaio --iodepth=1 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=1
fio-3.41
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=263MiB/s][w=2105 IOPS][eta 00m:00s]
buf-read: (groupid=0, jobs=1): err= 0: pid=23374: Thu Jun 18 02:17:38 2026
write: IOPS=2099, BW=262MiB/s (275MB/s)(1024MiB/3902msec); 0 zone resets
wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randwrite --bs=128k --size=1G --direct=1 --ioengine=libaio --iodepth=8 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=8
fio-3.41
Starting 1 process
Jobs: 1 (f=1)
buf-read: (groupid=0, jobs=1): err= 0: pid=23403: Thu Jun 18 02:17:48 2026
write: IOPS=3972, BW=497MiB/s (521MB/s)(1024MiB/2062msec); 0 zone resets
wy@u26:~$ echo 0 | sudo tee /sys/module/zfs/parameters/zfs_async_dio_enabled
0
wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randwrite --bs=128k --size=1G --direct=1 --ioengine=libaio --iodepth=1 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=1
fio-3.41
Starting 1 process
Jobs: 1 (f=1): [w(1)][-.-%][w=333MiB/s][w=2666 IOPS][eta 00m:00s]
buf-read: (groupid=0, jobs=1): err= 0: pid=23443: Thu Jun 18 02:18:13 2026
write: IOPS=2613, BW=327MiB/s (343MB/s)(1024MiB/3135msec); 0 zone resets
wy@u26:~$ fio --filename=/t/fs1/testfile --rw=randwrite --bs=128k --size=1G --direct=1 --ioengine=libaio --iodepth=8 --runtime=10 --name=buf-read
buf-read: (g=0): rw=randwrite, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=8
fio-3.41
Starting 1 process
Jobs: 1 (f=1): [w(1)][-.-%][w=330MiB/s][w=2636 IOPS][eta 00m:00s]
buf-read: (groupid=0, jobs=1): err= 0: pid=23471: Thu Jun 18 02:18:23 2026
write: IOPS=2657, BW=332MiB/s (348MB/s)(1024MiB/3083msec); 0 zone resets
|
There was a problem hiding this comment.
Sorry to be direct, but have you tested that? I'm testing it now and already found issues with memory mapped buffered writes. I'm no expert in that and trying to figure out but the logic is flaky.
Again, not trying to be a douche but that was LLM generated right? That's fine, I also generate lots of code with it but it should be tested throughful to be valid.
I still have nightmares with async IO, main due to how ZFS implements the DMU. This is a chapter that comes over and over again and have so many corner cases that we cannot easily antecipate them.
I accumulated some knowledge in the issue for like 5 years now, but again I'm no expert. I'm also interested in NVMe performance not only on Linux but also in FreeBSD. But I think this patch is not up to the standards for the project. Yes I cannot do it myself because I may lack the skills, but there's too many assumptions like the "double the pages" that screamed like LLM generated code for me.
|
@tiehexue I'm currently testing this PR in my local CI with I tried running ZTS on my local VM (again, with So I'd recommend you get ZTS to pass in your local VM with |
ok, I missed this. I run async test only in local, plus some mannual test, and verified in personal zfs fork CI, but not set |
|
@tonyhutter @amotin @viniciusferrao Thanks for all your review. One missing point as @tonyhutter said, I should run ZTS with zfs_async_dio_enable=1. Let me do it and come back (may need sometime). I would be happy, if you could keep reviewing/testing/working based on this PR (but better to wait I come back with ZTS passed with zfs_async_dio_enable). And I am glad about that @viniciusferrao questioned LLM involving. Yes, most of code, tests are written by copliot+deepseek-v4-pro. The most amazing thing is that it just works! For #18661 and this PR and others I created for deadlock, data corruption, LLM (prompted by me) can understand very broadly about the codebase, issues, panic stack trace, and solves them. And most important, the generated code is not too much to be reviewed. I am curiously about how do the upstream team review/re-test bigger changes like this and #18661 , #18679 , #18647 , #18660 , #18620 . Especially this one, should there be several RC cycles? |
|
I'm a bit upset that this intrinsically licensially-problematic LLM extrusion was thrown into the PR pool and then left to reviewers to grok and prove correctness; the world's petabytes of data integrity is in your hands. This does not feel like responsible engineering. On the other hand, a tempting performance win is a tempting performance win (IFF demonstrably correct, see above). It'll be an interesting test of the system. |
Like zfs, LLM are created by best of IT guys for the world, for you and for me. For async io in zfs, I need the upstream team heavily to involve as you should understand. And I will do my best to push this to high quality. Let us focus on the code (no birth prejudice). The amazing part here is that async IO in zfs is doable, may take some time. I will be more than excited if another PR gotten merged for it. |
Careful. This is very close to reading like "you should be grateful", which is unlikely to win me over (at least).
It's not amazing. It's well known that better async performance and facilities are possible, and many people have been putting in the careful, meticulous work over many many years to make it happen. Some, like #10018, made it over the line after years; others, like #10425 (and its predecessors), did not. Maybe you can claim that large, invasive changes wouldn't take so long if there was an LLM involved. Ask me again when one of those land.
In my own experience, the best chance you have to get something difficult merged is to go out of your way to make it easy for a reviewer to say yes. Be meticulous in your testing and your descriptions. Break it up into multiple commits or multiple PRs, each with good commit messages and comments. Go out of your way to show that you understand reviewers have very limited time. But also, understand they all want to say yes! I'll also say, this has nothing to do with LLMs. We don't (currently) have an AI policy; I personally am content to keep it that way. I have multiple LLM-generated branches and experiments here, and they're all very interesting, and a couple have even identified a way to solve a problem that I couldn't previously get my head around. And yet, all of them will never see light of day in that form, because there's multiple things in there that I would not be comfortable staking my own reputation on. So they sit there until I use them as a guide to reimplement the thing from scratch, my way. In that same spirit, I review LLM-based contributions based on whether I believe the person putting them forward could have written them without assistance. Or, put another way, whether they can explain what they've done and how it works. Anyway, that's enough from me on that; this is hardly the place anyway. On this one, and the others, get the tests passing and show you understand how it works and show reasonable confidence that it isn't going to lose data, and I'm more than interested! Or, route around me; ultimately I'm just a guy, there's other people here who can review and approve changes too, and they have different priorities or standards to me. And that's ok too, because I trust them :) |
|
There is a great deal to review here, and I do not have the time to review it all, so I am only commenting on LLMs and contributing in general. LLMs are useful tools, but at their core, they are a very high-level autocomplete. While they sometimes surprise me, I find that they often do not have the same idea for how things should work as I do. Sometimes I simply did not provide enough information, and they did not think to ask for clarity on the omission. Other times, they simply have their own ideas that are not always correct. As a general rule of thumb, for any non-trivial change, an LLM probably did several things that need manual correction. In general, before submitting a PR, the author should review every line of code in context for correctness and either have reasonable certainty that it is correct or state that it is a work-in-progress (WIP). That means looking at the surrounding code, examining functions that call it for possible unforeseen side effects, etc. This applies to all commits, whether or not an LLM was used to assist in writing them. Outside of trivial commits, the first thing people make that works looks very different from what they actually put into the final PR. Most of us hide that (since the intermediate stages of development are not what we want in the repository). A few (like the BRT changes) had a much more public development process. There have been times when I went through multiple rounds of reviewing my own commits and making revisions before making a PR, since, while things seem good while working on them, artifacts often make it into the end result that are not quite right. I have even been known to spot issues shortly after making a PR, leading to multiple rounds of revisions from last-minute nitpicking over correctness. Finally, it is okay to submit PRs that are a WIP. Not everyone has the time to set up their own test infrastructure, and it can be genuinely useful to lean on the project's infrastructure to test things. That said, such PRs should always be marked as such.
For anything non-trivial, it is likely a unique work, so I am not sure how a licensing issue could arise. If it is trivial, it is probably not a unique work, but triviality takes precedence. That said, some projects are more cautious here, but I am not sure if there is any need. The technology at its core is a very clever auto-complete (studying supersonic13/llama3.c makes this very clear). As long as what you are doing with it is unique, it should produce a unique output. Since it learned in a manner that emulates human learning, the unique output should be a transformative use of what it learned, just like what humans produce is a transformative use of what we learned before: https://en.wikipedia.org/wiki/Transformative_use For example, this entire response, minus quotations where I have fair use, is not just unique but is also a transformative use of all of the English writing I have experienced in my life. I am thus free to license it under any license I wish.
The community is often thrilled to help mentor people. I myself started as a student who knew relatively little, but learned from feedback. Developing a filesystem is a slow process to minimize the possibility of introducing regressions. For non-trivial changes, one often must go through multiple rounds of writing, testing, and self-reviewing in response to comments from other contributors. Part of this is to ensure quality. If you submit your best effort where you cannot see any way to improve it, you often get higher-quality responses that help make it even better. Another part of it is that filesystem development is very hard. Erez Zadok once said that filesystem development is harder than rocket science, since a rocket only needs to work once while a filesystem must work all the time, in millions of places, 24x7. Now, with SpaceX, perhaps rocket science is catching up, but that really underscores how hard making code that works every time is. Also, despite our best efforts, mistakes still get past us from time to time.
Agreed. I see no reason to limit the tools people use as long as they take responsibility for the output. To give an example, the other day at work, I used a LLM to refactor a 50+ line function in a specific way. I could have done it myself, but it was faster to tell a LLM what I planned to do and let it make the changes. It output an improved function. I then reviewed every part of it in detail to verify it was what I had intended to produce. This is something I would have done even if I had written it myself, since it is easy to write something that has mistakes that are caught upon review. |
This is true async linux only when using libaio, io_uring, etc. We already has true async in zio_nowait, so wrap it in dmu, vfs, zpl layer. It works well with multple times faster, refer to newly added tests for more information. Signed-off-by: tiehexue <tiehexue@hotmail.com>
Signed-off-by: tiehexue <tiehexue@hotmail.com>
Signed-off-by: tiehexue <tiehexue@hotmail.com>
zfs_setup_direct is not called in async write path as we did in async read path, because pin pages need wait "a long time in write path which will cause FOLL_LONGTERM hitting kernel limit. Signed-off-by: tiehexue <tiehexue@hotmail.com>
|
@tonyhutter Hi, I got problem to set up running ZTS locally, a lot kernel panics not related this PR. So I enabled async by default in the code in another branch (not in upstream PR), and run CI in my personal zfs fork, 9 of 12 linux os finished with green, the failed three looked not related to this PR. |
It should be misleading in some OS crash to pin 2x pages. Now by adding more gates, this should be revert. Signed-off-by: tiehexue <tiehexue@hotmail.com>
io_uring may not exist in older kernel like 4.18, just let test case PASS and leave a note, unless the CI failed unexpectedly. Signed-off-by: tiehexue <tiehexue@hotmail.com>
This feature is only available in linux platform. Signed-off-by: tiehexue <tiehexue@hotmail.com>
|
Hi @amotin @tonyhutter @viniciusferrao , I resolved issues, verified by CI with both zfs_async_dio_enabled set to 1 and 0 for all ZTS. And @AntonHPE provide test results in real hardware in #18660 . 5x faster on randread and 50% faster on randwrite, but also much slower than xfs+zvol. This nearly kill my passion to this PR. So I have a question. Assume same setup, same test, if xfs+zvol have e.g. 300K IOPS, especially write IOPS, what is the most theoretically number for zfs? If the number is performant enough compare to xfs+zvol, I would move on. If there is a limit (becasue zfs has much more robust data integrity?), we must think in other way. How do we calculate the number? Suppose: zfs create -o internal_fs=xfs -o fs_limit=1T tank/fsAnd we actually create it with xfs+zvol. Is this still a zfs filesystem? With zvol have a lot nice features, and with xfs we have peek performance. What do actually lost? And can we retain most important zfs features plus this kind of performance? |
Improving performance is rarely just a matter of improving 1 thing. Performance improvements like those are genuinely amazing, and if this gets through the review process, future performance work will likely build on it. |
|
It might be useful for further investigation. I created zvol (without xfs or zfs filesystems on top of zvol) I re-created zvol from scratch. [root@memverge4 anton]# zpool create -f -o ashift=12 -O recordsize=16K -O atime=off -O xattr=sa -O compression=off -O dedup=off -o autotrim=on tank raidz2 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BQ0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BS0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BT0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BW0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BX0C93 /dev/disk/by-id/nvme-MV003200LXUJK_ZEE0A09A0C93 So, single thread 16KB random reads pure zvol (direct+async) 260k |
Thanks Anton. This could be explained, zvol has no filesystem overhead. I am building a package with both zvol directio and zfs true async. It would be better you can test again. The zfs part, I made two optimization, one removed metadata compress, the other using batch vdev queue. Will let you know when it is ready. |
Thanks, @ryao . I am re-filled now. Two more optimization are added today, I will ask @AntonHPE to verify at his will, if good result, I will push to this PR. One module parameter to disable metadata compressing, one to batch vdev queue, based on Anton's test result, there are 14% lz4, and 17% spin lock in write path. |
|
@AntonHPE Hi, would you like to see if could download from https://github.com/tiehexue/zfs/releases , where I also provide some instructions about the newly added module parameters. |
|
@tiehexue
I downloaded, results tomorrow.
вт, 23 июн. 2026 г. в 17:47, tiehexue ***@***.***>:
… *tiehexue* left a comment (openzfs/zfs#18684)
<#18684 (comment)>
@AntonHPE <https://github.com/AntonHPE> Hi, would you like to see if
could download from https://github.com/tiehexue/zfs/releases , where I
also provide some instructions about the newly added module parameters.
—
Reply to this email directly, view it on GitHub
<#18684?email_source=notifications&email_token=AJGSSF7C5UGLOH7L34LEV6T5BKJ7HA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZYGA2DENZYGU22M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4780427855>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJGSSF5UU2CQY3AX2GNDCV35BKJ7HAVCNFSNUABCKJSXA33TNF2G64TZHM2DGNZQGEYTWSLTON2WKOZUGY4DKMJVGY4DCM5BOYBA>
.
Triage notifications, keep track of coding agent tasks and review pull
requests on the go with GitHub Mobile for iOS
<https://github.com/notifications/mobile/ios/AJGSSFZRCPFH2G4EJY74SU35BKJ7HA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZYGA2DENZYGU22M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJKTGN5XXIZLSL5UW64Y>
and Android
<https://github.com/notifications/mobile/android/AJGSSF7RRGUG2UNZAOTNZK35BKJ7HA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZYGA2DENZYGU22M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLTGN5XXIZLSL5QW4ZDSN5UWI>.
Download it today!
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I followed to this link https://github.com/tiehexue/zfs/releases
Remove previous and install latest OpenZFS 2.4.99
1. zvol_dio_enabled: default 1, enable DirectIO in ZVOL for both linux
and FreeBSD
Default is 0
***@***.*** anton]# cat /sys/module/zfs/parameters/zvol_dio_enabled
0
***@***.*** anton]# echo 1 > /sys/module/zfs/parameters/zvol_dio_enabled
***@***.*** anton]# cat /sys/module/zfs/parameters/zvol_dio_enabled
1
***@***.*** anton]#
2. zfs_vdev_queue_batch (linux only): batched vdev queue, default 0.
***@***.*** anton]# echo 1 >
/sys/module/zfs/parameters/zfs_vdev_queue_batch
***@***.*** anton]# cat /sys/module/zfs/parameters/zfs_vdev_queue_batch
1
***@***.*** anton]#
3. zfs_mdcomp_enabled (linux only): default 1, metadata compressing.
***@***.*** anton]# cat /sys/module/zfs/parameters/zfs_mdcomp_enabled
1
***@***.*** anton]#
4. zfs_async_dio_enabled (linux only): default 1, enable true async IO.
***@***.*** anton]# cat /sys/module/zfs/parameters/zfs_async_dio_enabled
1
***@***.*** anton]#
***@***.*** anton]# zpool create -f -o ashift=12 -O recordsize=16K -O
atime=off -O xattr=sa -O compression=off -O dedup=off -o autotrim=on tank
raidz2 /dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BQ0C93
/dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BS0C93
/dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BT0C93
/dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BW0C93
/dev/disk/by-id/nvme-MV003200LXUJK_ZED0A0BX0C93
/dev/disk/by-id/nvme-MV003200LXUJK_ZEE0A09A0C93
***@***.*** anton]#
***@***.*** anton]# zfs create -V 128G -o volblocksize=16K -o
compression=off -o dedup=off tank/fiotest
***@***.*** anton]#
Random read on zvol
***@***.*** anton]# fio --name=test --rw=randread --bs=16k
--filename=/dev/zvol/tank/fiotest --direct=1 --numjobs=1 --iodepth=64
--exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T)
16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-67-gfb5b
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=4008MiB/s][r=256k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=60575: Wed Jun 24 09:05:33 2026
read: IOPS=255k, BW=3985MiB/s (4178MB/s)(233GiB/60001msec)
slat (nsec): min=510, max=3241.9k, avg=1968.04, stdev=3702.00
clat (usec): min=25, max=4046, avg=248.83, stdev=128.64
lat (usec): min=42, max=4047, avg=250.79, stdev=128.37
Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.):
984934497842 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
42.14% [kernel] [k] osq_lock
7.05% [kernel] [k]
native_queued_spin_lock_slowpath
2.22% [kernel] [k]
mutex_spin_on_owner
2.19% [kernel] [k] read_tsc
2.04% [kernel] [k]
LZ4_uncompress_unknownOutputSize
1.58% [kernel] [k] __slab_free
1.21% [kernel] [k]
_raw_spin_lock_irqsave
1.08% [kernel] [k]
fletcher_4_avx512f_native
1.02% [kernel] [k] dbuf_compare
1.00% [kernel] [k] zio_done
0.95% [kernel] [k] try_to_wake_up
0.94% [kernel] [k] zio_create
0.90% [kernel] [k]
default_wake_function
0.83% [kernel] [k]
enqueue_task_fair
0.80% [kernel] [k] __switch_to
0.77% [kernel] [k] _raw_spin_lock
0.67% [kernel] [k]
select_task_rq_fair
0.64% [kernel] [k]
zio_vdev_io_done
0.64% [kernel] [k]
percpu_counter_add_batch
0.59% [kernel] [k] mutex_lock
0.50% [kernel] [k]
vdev_raidz_child_done
0.49% [kernel] [k] zio_execute
Random write on zvol
***@***.*** anton]# fio --name=test --rw=randwrite --bs=16k
--filename=/dev/zvol/tank/fiotest --direct=1 --numjobs=1 --iodepth=64
--exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T)
16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-67-gfb5b
Starting 1 process
Jobs: 1 (f=1): [f(1)][100.0%][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=60701: Wed Jun 24 09:07:47 2026
write: IOPS=178k, BW=2778MiB/s (2913MB/s)(163GiB/60001msec)
slat (nsec): min=651, max=3058.1k, avg=2617.71, stdev=4299.24
clat (usec): min=21, max=12859, avg=357.09, stdev=292.38
lat (usec): min=45, max=12861, avg=359.70, stdev=292.20
Samples: 846K of event 'cycles:P', 4000 Hz, Event count (approx.):
514908625822 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
19.68% [kernel] [k] native_queued_spin_lock_slowpath
3.97% [kernel] [k] osq_lock
3.65% [kernel] [k] LZ4_compressCtx
3.49% [kernel] [k] read_tsc
2.28% [kernel] [k] __slab_free
1.91% [kernel] [k] mutex_spin_on_owner
1.68% [kernel] [k] _raw_spin_lock
1.64% [kernel] [k] fletcher_4_avx512f_native
1.57% [kernel] [k] _raw_spin_lock_irqsave
1.41% [kernel] [k] zio_create
1.37% [kernel] [k] zio_done
1.33% [kernel] [k] try_to_wake_up
1.21% [kernel] [k] enqueue_task_fair
1.18% [kernel] [k] default_wake_function
1.14% [kernel] [k] mutex_lock
1.11% [kernel] [k] __switch_to
0.94% [kernel] [k] dbuf_compare
0.89% [kernel] [k] percpu_counter_add_batch
0.87% [kernel] [k] select_task_rq_fair
0.86% [kernel] [k] zio_vdev_io_done
0.84% [kernel] [k] vdev_raidz_child_done
0.76% [kernel] [k] zio_execute
0.72% [kernel] [k] zio_vdev_io_assess
0.69% [kernel] [k] mutex_unlock
0.66% [kernel] [k] dequeue_task_fair
Random read on zfs
***@***.*** anton]# zfs create tank/testfs
***@***.*** anton]# fio --name=test --rw=randread --bs=16k
--filename=/tank/testfs/testfile --direct=1 --numjobs=1 --iodepth=64
--exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T)
16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-67-gfb5b
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=847MiB/s][r=54.2k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=64049: Wed Jun 24 09:50:17 2026
read: IOPS=52.7k, BW=824MiB/s (864MB/s)(48.3GiB/60001msec)
slat (usec): min=6, max=197, avg=18.16, stdev= 8.52
clat (usec): min=88, max=2259, avg=1195.32, stdev=223.26
lat (usec): min=103, max=2292, avg=1213.48, stdev=226.80
Samples: 318K of event 'cycles:P', 4000 Hz, Event count (approx.):
105317738498 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
6.19% [kernel] [k] __slab_free
6.00% [kernel] [k] read_tsc
3.15% [kernel] [k] zio_done
2.97% [kernel] [k] LZ4_uncompress_unknownOutputSize
2.30% [kernel] [k] zio_create
2.29% [kernel] [k] fletcher_4_avx512f_native
2.16% [kernel] [k] try_to_wake_up
1.83% [kernel] [k] default_wake_function
1.71% [kernel] [k] _raw_spin_lock_irqsave
1.68% [kernel] [k] select_task_rq_fair
1.62% [kernel] [k] zio_execute
1.51% [kernel] [k] percpu_counter_add_batch
1.49% [kernel] [k] enqueue_task_fair
1.48% [kernel] [k] _raw_spin_lock
1.42% [kernel] [k] zio_vdev_io_done
1.38% [kernel] [k] kmem_cache_free
1.30% [kernel] [k] vdev_raidz_child_done
1.28% [kernel] [k] dbuf_compare
1.21% [kernel] [k] update_rq_clock_task
1.00% [kernel] [k] put_prev_task_idle
Random write on zfs
***@***.*** anton]# fio --name=test --rw=randwrite --bs=16k
--filename=/tank/testfs/testfile --direct=1 --numjobs=1 --iodepth=64
--exitall --group_reporting --ioengine=libaio --runtime=60 --time_based
test: (g=0): rw=randwrite, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T)
16.0KiB-16.0KiB, ioengine=libaio, iodepth=64
fio-3.42-67-gfb5b
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=383MiB/s][w=24.5k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=64982: Wed Jun 24 09:52:04 2026
write: IOPS=25.2k, BW=393MiB/s (412MB/s)(23.0GiB/60001msec)
slat (usec): min=15, max=1465, avg=38.61, stdev=17.54
clat (usec): min=20, max=6361, avg=2504.96, stdev=518.03
lat (usec): min=46, max=6433, avg=2543.56, stdev=525.33
Samples: 541K of event 'cycles:P', 4000 Hz, Event count (approx.):
253610490543 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
15.19% [kernel] [k] native_queued_spin_lock_slowpath
13.50% [kernel] [k] LZ4_compressCtx
3.58% [kernel] [k] read_tsc
3.19% [kernel] [k] LZ4_uncompress_unknownOutputSize
2.97% [kernel] [k] __slab_free
1.58% [kernel] [k] zio_done
1.55% [kernel] [k] zio_create
1.48% [kernel] [k] _raw_spin_lock_irqsave
1.33% [kernel] [k] _raw_spin_lock
1.20% [kernel] [k] default_wake_function
1.12% [kernel] [k] percpu_counter_add_batch
1.05% [kernel] [k] enqueue_task_fair
1.04% [kernel] [k] mutex_spin_on_owner
1.04% [kernel] [k] try_to_wake_up
1.01% [kernel] [k] zio_execute
1.00% [kernel] [k] osq_lock
0.89% [kernel] [k] select_task_rq_fair
0.89% [kernel] [k] vdev_raidz_child_done
0.85% [kernel] [k] zio_vdev_io_done
0.79% [kernel] [k] __switch_to
0.79% [kernel] [k] mutex_lock
0.76% [kernel] [k] wbt_data_dir
It is getting better for zvol (especially random write), but getting worse
for zfs.
вт, 23 июн. 2026 г. в 21:14, Anton Gavriliuk ***@***.***>:
…
@tiehexue
I downloaded, results tomorrow.
вт, 23 июн. 2026 г. в 17:47, tiehexue ***@***.***>:
> *tiehexue* left a comment (openzfs/zfs#18684)
> <#18684 (comment)>
>
> @AntonHPE <https://github.com/AntonHPE> Hi, would you like to see if
> could download from https://github.com/tiehexue/zfs/releases , where I
> also provide some instructions about the newly added module parameters.
>
> —
> Reply to this email directly, view it on GitHub
> <#18684?email_source=notifications&email_token=AJGSSF7C5UGLOH7L34LEV6T5BKJ7HA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZYGA2DENZYGU22M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4780427855>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AJGSSF5UU2CQY3AX2GNDCV35BKJ7HAVCNFSNUABCKJSXA33TNF2G64TZHM2DGNZQGEYTWSLTON2WKOZUGY4DKMJVGY4DCM5BOYBA>
> .
> Triage notifications, keep track of coding agent tasks and review pull
> requests on the go with GitHub Mobile for iOS
> <https://github.com/notifications/mobile/ios/AJGSSFZRCPFH2G4EJY74SU35BKJ7HA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZYGA2DENZYGU22M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJKTGN5XXIZLSL5UW64Y>
> and Android
> <https://github.com/notifications/mobile/android/AJGSSF7RRGUG2UNZAOTNZK35BKJ7HA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZYGA2DENZYGU22M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLTGN5XXIZLSL5QW4ZDSN5UWI>.
> Download it today!
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
Thanks. zfs_vdev_queue_batch set to 1, I see 50% write IOPS up, but not reproduced in your environment.
zfs_vdev_queue_batch will have impact on zvol, 178K? Is that 23.7K before?
|
|
Got it! This two parameters not beneficial, I will not push to this PR. zfs_vdev_queue_batch |
|
I think no. I'm confused. Let me clarify. I think misunderstanding comes from the beginning. Initially I post the request - #18644 Later add another request - #18660 Let's omit "sequential read/write only with bs=1m and 2m on zvol", as it works perfectly, near limits So, while we focused on direct+async, we have two different workloads (random and sequential) and two different targets (zvol and zfs (and zvol+xfs)) I just re-run all, now focus on: zvol 256 kIOPS 180 kIOPS [root@memverge4 anton]# cat /sys/module/zfs/parameters/zvol_dio_enabled |
Yes, but I'm ready to proceed. Direct+async random I/O should be improved for zfs. I provided perf outputs above, might me helpful. |

Motivation and Context
This is for #18660 .
Description
zio_nowait is async, so, we just wrap it to dmu, vfs, and zpl. Both read/write have a callback, and write has two callbacks in a chain for data and metadata. Metadata callback is issued from system taskq.
To make this PR production ready, I need help for code review, design review and test in real hardware.
It works fine in ubuntu 26, and I tried to ensure works for linux distribs in the CI. Below is a simple test in an local vm, when iodepth increase, we see better IOPS.
How Has This Been Tested?
Tested in ubuntu 26, personal zfs fork CI.
Types of changes
Checklist:
Signed-off-by.