Skip to content

Conversation

@Developer-Ecosystem-Engineering
Copy link

@Developer-Ecosystem-Engineering Developer-Ecosystem-Engineering commented Mar 12, 2025

The current behavior of flushing the pagecache via posix_fadvise(POSIX_FADV_DONTNEED) is not a valid operation on macOS.

This can result in adverse outcomes when measuring performance. The change will instead flush via mmap and msync, which better reflects the desired behavior on macOS

@Developer-Ecosystem-Engineering
Copy link
Author

Thank you for the fast review @axboe, will integrate the requested changes!

@Developer-Ecosystem-Engineering
Copy link
Author

Developer-Ecosystem-Engineering commented Mar 31, 2025

The latest commit should address all the open feedback, thank you @sitsofe much and @axboe for the feedback. Let us know if there is anything else necessary or desired!

  • Remove debug print
  • Make all comments a single *
  • Use /* */ instead of //
  • Return instead of exit
  • Ensure errno is populated
  • Direct return instead of using a placeholder
  • Move to APPLE
  • Migrate code to os/os-mac.h since it is about macOS

@axboe
Copy link
Owner

axboe commented Mar 31, 2025

You should fold it into a single patch and force push it again, it doesn't make a lot of sense to have a broken state in the upstream tree.

@Developer-Ecosystem-Engineering
Copy link
Author

You should fold it into a single patch and force push it again, it doesn't make a lot of sense to have a broken state in the upstream tree.

Not a problem, taking a peek at that failure real quick. Once everyone is satisfied, will squash and force push.

Copy link
Collaborator

@sitsofe sitsofe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of interest do you see a difference running the following job with and without your change?

./fio --filename=fio.tmp --stonewall --thread --size=1G --filename=fio.tmp --bs=4k \
  --name=create --rw=write \
  --name=cached --rw=read --loops=2 --invalidate=0 \
  --name=invalidated --rw=read --loops=2 --invalidate=1

@Developer-Ecosystem-Engineering
Copy link
Author

Developer-Ecosystem-Engineering commented May 14, 2025

Squashed all commits and force pushed as requested.

Just out of interest do you see a difference running the following job with and without your change?

./fio --filename=fio.tmp --stonewall --thread --size=1G --filename=fio.tmp --bs=4k \
  --name=create --rw=write \
  --name=cached --rw=read --loops=2 --invalidate=0 \
  --name=invalidated --rw=read --loops=2 --invalidate=1

@sitsofe we'd expect to see the removal of cached reads on macOS.

An example data set
27.2GiB/s-27.2GiB/s (29.2GB/s-29.2GB/s) -- A cached read
3290MiB/s-3290MiB/s (3450MB/s-3450MB/s), -- An uncached read

Copy link
Collaborator

@sitsofe sitsofe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moving in the right direction! However I've been running this PR on my Intel Mac and I didn't see the expected results with this command:

./fio --stonewall --size=128M --ioengine=posixaio --filename=fio.tmp --iodepth=64 \
      --bs=4k --invalidate=0 --direct=0 \
      --name=create --rw=write \
      --name=cached --rw=randread --loops=2 \
      --name=invalidated --rw=randread --loops=2 --invalidate=1  | grep groupid= -A 1
create: (groupid=0, jobs=1): err= 0: pid=40797: Sat May 17 19:49:50 2025
  write: IOPS=172k, BW=670MiB/s (703MB/s)(128MiB/191msec); 0 zone resets
--
cached: (groupid=1, jobs=1): err= 0: pid=40798: Sat May 17 19:49:50 2025
  read: IOPS=36.9k, BW=144MiB/s (151MB/s)(256MiB/1775msec)
--
invalidated: (groupid=2, jobs=1): err= 0: pid=40799: Sat May 17 19:49:50 2025
  read: IOPS=40.4k, BW=158MiB/s (165MB/s)(256MiB/1623msec)

That cached result seemed too low so I played about and I've got a branch that I can share which shuffles things around and hooks up more of fadvise that shows the expected results:

create: (groupid=0, jobs=1): err= 0: pid=42100: Sat May 17 19:57:59 2025
  write: IOPS=180k, BW=703MiB/s (737MB/s)(128MiB/182msec); 0 zone resets
--
cached: (groupid=1, jobs=1): err= 0: pid=42101: Sat May 17 19:57:59 2025
  read: IOPS=216k, BW=842MiB/s (883MB/s)(256MiB/304msec)
--
invalidated: (groupid=2, jobs=1): err= 0: pid=42102: Sat May 17 19:57:59 2025
  read: IOPS=63.8k, BW=249MiB/s (261MB/s)(256MiB/1027msec)

@axboe how do we feel about committer email addresses that are the obfuscated "made by the GitHub web interface" type? After the squashing it currently says this:

Author: Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com>

@Developer-Ecosystem-Engineering
Copy link
Author

Hi @sitsofe,

Have most of what you've requested ready, but want to give the project full answers to all the open questions, will deliver a new force push squash with the changes + answers for the remaining open questions above. It takes a little bit for us to work through the qualification on the answers, appreciate the patience.

@sitsofe
Copy link
Collaborator

sitsofe commented Jun 18, 2025

Hi @Developer-Ecosystem-Engineering:

Have most of what you've requested ready, but want to give the project full answers to all the open questions, will deliver a new force push squash with the changes + answers for the remaining open questions above. It takes a little bit for us to work through the qualification on the answers, appreciate the patience.

It's just a bit of a shame the turnaround time is so long because I'd love to see your work go in as soon as possible! It's nice to keep the momentum but I appreciate you need responses from others and that depends on their time.

In the mean time I've put up https://github.com/sitsofe/fio/tree/refs/heads/macos_posix_fadvise which is an idea which builds on your work and moves it into an object of its own. I've made the commit message a bit longer to:

  • Show what's been tested and how things have changed
  • Provide references to old commits that do something similar on other platforms

What do you think of it? If it looks useful please take from it!

Finally, would it be possible for you to use a Real Name in your commit's? Currently we have this:

Author:     Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com>
AuthorDate: Wed Mar 12 15:27:29 2025 -0700
Commit:     Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com>
CommitDate: Wed May 14 09:58:58 2025 -0700
[...]
Signed-off-by: [<groupemail>]@apple.com

I've talked to @axboe and he said this:

What I care about is the identify identifies a person, and:

Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com>

literally means nothing. So yeah, that needs to be a proper [identity] and email.

So could you:

  • Make sure your commits have a real email address (as opposed to GitHub obfuscated one)
  • Could you use a Real Name with your commit

(see https://stackoverflow.com/questions/3042437/how-can-i-change-the-commit-author-for-a-single-commit for hints on how to fix up existing commits with new authors)

Thanks!

@Developer-Ecosystem-Engineering
Copy link
Author

Hi @sitsofe, have read through the changes in your branch and it looks to incorporate all of the feedback & open questions already. You noticed the test failure and the downstream posix_fadvise issue. It also addresses the feedback about the comments and organization.

It probably makes sense for the project to take your changes since it incorporates all the feedback, we looked at rebasing on top of it with any changes we would have and there was nothing left to add! It also solves the open feedback on sign-off.

The other open question above was answered and there is one remaining, which I am working on.

If we mmap a 1TByte file fully but don't write to will any extra memory be used aside from set up of the TLB?

@sitsofe
Copy link
Collaborator

sitsofe commented Jul 16, 2025

@Developer-Ecosystem-Engineering:

Just for the record - I never intended to steal your thunder but thank you for blessing my additions. I want to see your idea and work go into fio in some form because it solves a genuine behaviour issue on macOS.

It also solves the open feedback on sign-off.

Not quite, if we're going to go with my branch then I would end up squashing everything into one commit and it is only fair to to credit the person that delivered the solution to the UBC cache invalidation problem I failed to solve several years ago. Please could I have a name and email address so I can add a Suggested-by / Co-developed-by line? If you really want it to be the ecosystem address I will use that...

@Developer-Ecosystem-Engineering
Copy link
Author

Hi @sitsofe

We need no thunder, it's OK =) Just want to help too. We'd prefer that we use the group identity if you must use anything, but it's perfectly fine to not have it, it's your code base! If you don't want the GitHub one as its a bit verbose, our GitHub handle without the dashes is the email address.

Must we unmask the Dread Pirate Roberts? =)

@Developer-Ecosystem-Engineering
Copy link
Author

If we mmap a 1TByte file fully but don't write to will any extra memory be used aside from set up of the TLB?

When you mmap a 1TByte file without actually writing to it, the primary work is in virtual memory setup. The kernel will only modify the process memory map, but no pages table entries are created until writes and accompanying page faults occur.

These faults would occur during access time. If the information is in the UBC, then the process mapped address range will use the resident pages.

@sitsofe
Copy link
Collaborator

sitsofe commented Jul 18, 2025

Must we unmask the Dread Pirate Roberts? =)

Gotcha Westley @Developer-Ecosystem-Engineering - your mystique shall continue on.

Re mmap: thanks for the update. I'll see to reworking things a bit Real Soon Now™ and I'll add you to the review so you can see it going past.

This (finally) provides macOS cache invalidation and is heavily based on
code originally provided by DeveloperEcosystemEngineering@apple.

Because posix_fadvise() isn't implemented on macOS,
DeveloperEcosystemEngineering demonstrated that creating a shared
mapping of a file and using using msync([...], MS_INVALIDATE) on it can
be used to discard covered page cache pages instead - ingenious! This
commit uses that technique to create a macOS posix_fadvise([...],
POSIX_FADV_DONTNEED) shim.

To paraphrase commit 8300eba ("windowsaio: add best effort cache
invalidation") that was done for similar reasons:

This change may make default bandwidth speeds on macOS look lower
compared to older versions of fio but this matches the behaviour of fio
on other platforms with invalidation (such as Linux) because we are
trying to avoid measuring cache reuse (unless invalidate=0 is set).

The impact of invalidation is demonstrated by the bandwidths achieved by
the following jobs running on an SSD of an otherwise idle Intel Mac
laptop with 16GBytes of RAM:

./fio --stonewall --size=128M --ioengine=posixaio --filename=fio.tmp \
  --iodepth=64 --bs=4k --direct=0 \
  --name=create --rw=write \
  --name=cached --rw=randread --loops=2 --invalidate=0 \
  --name=invalidated --rw=randread --loops=2 --invalidate=1

[...]
cached: (groupid=1, jobs=1): err= 0: pid=7795: Tue Sep  2 22:34:12 2025
  read: IOPS=228k, BW=889MiB/s (932MB/s)(256MiB/288msec)
[...]
invalidated: (groupid=2, jobs=1): err= 0: pid=7796: Tue Sep  2 22:34:12 2025
  read: IOPS=46.8k, BW=183MiB/s (192MB/s)(256MiB/1399msec)

v2:
- Move platform specific code into its own file under os/mac/
- Don't do prior fsync() because msync([...], MS_INVALIDATE) doesn't
  imply the dropping of dirty pages and will have the same effect

v3:
- Up the mmap chunk size to 16 GBytes to reduce the number of times we
  mmap()/msync()/munmap() on large files
- Align offset and len to the system page size to prevent errors on jobs
  like ./fio --name=n --offset=2k --size=30k
- Try and munmap() if msync() fails
- Make Rosetta comment clearer
- Drop some variables and rename some others
- Don't bother trying to restore errno after displaying an error message
  because posix_fadvise() isn't defined as setting errno

Fixes: axboe#48
Suggested-by: DeveloperEcosystemEngineering <[email protected]>
Signed-off-by: Sitsofe Wheeler <[email protected]>
- Add support for POSIX_FADV_NORMAL in the posix_fadvise() shim by just
  ignoring it
- Add support for POSIX_FADV_SEQUENTIAL/POSIX_FADV_RANDOM by mapping
  them to enable/disable of readahead via fcntl(..., F_RDAHEAD, ...).
  Because macOS only lets you control readahead at the descriptor level
  the offset and len values passed will be ignored and range control is
  not done.

The impact of being able to tune readahead is demonstrated by the
bandwidths achieved by the following jobs running on an SSD of an
otherwise idle Intel Mac laptop with 16GBytes of RAM:

./fio --stonewall --size=128M --filename=fio.tmp --bs=4k --rw=read \
  --name=sequential-readahead --fadvise=sequential \
  --name=sequential-no-readahead --fadvise=random

[...]
sequential-readahead: (groupid=0, jobs=1): err= 0: pid=6250: Tue Sep  2 22:10:45 2025
  read: IOPS=331k, BW=1293MiB/s (1356MB/s)(128MiB/99msec)
[...]
sequential-no-readahead: (groupid=1, jobs=1): err= 0: pid=6251: Tue Sep  2 22:10:45 2025
  read: IOPS=25.9k, BW=101MiB/s (106MB/s)(128MiB/1263msec)

rm -f fio-huge.tmp
truncate -s 1T fio-huge.tmp
./fio --stonewall --filename=fio-huge.tmp --bs=32k --runtime=10s --rw=randread:3 \
  --name=partial-random-no-readahead --fadvise=random \
  --name=absorb-cache-invalidation --number_ios=1 --bs=4k \
  --name=partial-random-readahead --fadvise=sequential

[...]
partial-random-no-readahead: (groupid=0, jobs=1): err= 0: pid=6259: Tue Sep  2 22:12:35 2025
  read: IOPS=92.4k, BW=2888MiB/s (3029MB/s)(28.2GiB/10001msec)
[...]
partial-random-readahead: (groupid=2, jobs=1): err= 0: pid=6261: Tue Sep  2 22:12:35 2025
  read: IOPS=61.8k, BW=1931MiB/s (2024MB/s)(18.9GiB/10001msec)

Signed-off-by: Sitsofe Wheeler <[email protected]>
@sitsofe sitsofe force-pushed the improve_flushing_darwin branch from 6979c18 to 714d1c3 Compare September 3, 2025 19:35
@sitsofe
Copy link
Collaborator

sitsofe commented Sep 3, 2025

I finally found the time to work on this but I have a bunch of things that I want to bring to people's attention. Here’s a summary:

  • Is invalidating all of an otherwise partial page OK?
  • Should readahead be hinted at by job type on macOS?
  • Should the mmap chunk size be something other than 16GiB?
  • There seems to be a bug truncating massive files on macOS 15.6 under APFS (@Developer-Ecosystem-Engineering)

It’s possible to ask fio to invalidate the cache for a range that isn’t page aligned (e.g. by doing ./fio --name=n --offset=2k --size=18k ) but the mmap based shim has to work with a granularity of a page. The Linux posix_fadvise man page says the following about POSIX_FADV_DONTNEED:

Requests to discard partial pages are ignored. It is preferable to preserve needed data than discard unneeded data. If the application requires that data be considered for discarding, then offset and size must be page-aligned.

The Open Group posix_fadvise documentation and the FreeBSD posix_fadvise man page are silent about what happens to partial pages. For the fio macOS posix_fadvise shim I’ve taken the approach of invalidating all of a page that would otherwise only be partially covered by aligning the offset down and the len up. I can change this if people would prefer more Linux-like behaviour.

Since a large mmap doesn’t really cost that much more memory I figured I would set the mmap chunk size to be larger so that fewer mmap/msync/munmap calls would be done thus reducing the overhead of setting things up/tearing them down repeatedly. Here are the results of my testing:

macOS 15.6 Intel (2GHz i5) msync(MS_INVALIDATE)

mmap chunk size (GiB) 0.5 1 8 16 128 1024
linear read 0.3710356 0.4872094 0.3823824 0.3986368 0.5639938 1.9420478
rand read 10.265224 13.656658 11.046206 11.286724 11.46629 11.436732
gappy 7.854328 10.5910972 9.5441084 9.5344792 9.5729838 9.5502052
near empty 0.0213032 0.0185342 0.0037254 0.0018804 0.0002896 0.000118

macOS 15.6 M4 Pro msync(MS_INVALIDATE)

mmap chunk size (GiB) 0.5 1 8 16 128 1024
linear read 0.0514746 0.0448044 0.0401286 0.0419396 0.0646114 0.2266038
rand read 1.3273664 1.3098192 1.2935882 1.3060148 1.2855568 1.3004732
gappy 1.1701312 1.1473312 1.1202244 1.1186982 1.1388276 1.1316176
near empty 0.0251256 0.0101194 0.0011742 0.0009186 0.000453 0.000403

macOS 15.6 M4 Pro Rosetta msync(MS_INVALIDATE)

mmap chunk size (GiB) 0.5 1 8 16 128 1024
linear read 0.0592442 0.0535576 0.0443478 0.0457978 0.0666314 0.2284822
rand read 1.3480976 1.333327 1.3117098 1.3103916 1.3127036 1.2981126
gappy 1.1752522 1.1550404 1.1484232 1.1360098 1.134033 1.143597
near empty 0.028696 0.0207544 0.0085264 0.0080604 0.0063026 0.006159

Linux AMD EPYC 7742 (2.2Ghz) posix_fadvise([…], POSIX_FADV_DONTNEED)

Fill I/O pattern Time (s)
linear read 1.6783038
rand read 4.6355044
gappy 4.0741604
near empty 0.0000204

The additional patch to do log the time to do invalidation (0001-DRAFT-add-cache-invalidation-logging.patch) and a script to do the benchmarking (bench.sh) are attached. Results are the time it took to invalidate the page cache after performing 8 GBytes of 4 KByte sized reads in a variety of patterns/situations against a 1 TByte sparse file (and different mmap chunk sizes on macOS). Where there are multiple results in a row the quickest time is highlighted in bold.

Summary: The M4 Pro Mac results are extremely quick, the Intel Mac machine was in last place but performed best with a small mmap chunk size and using macOS’ Rosetta doesn’t slow things down by that much. To my surprise, it's only when the page cache is empty does a larger chunk size perform the best! On the Apple Silicon Mac using a chunk size of 16 GiB compared to 1 GiB took a tiny bit less time in the near empty page cache case (which I would hope is the most common scenario) and didn't do badly in other cases so I’ve gone with that but I’m open to suggestions.

While doing the aforementioned tuning, I was shocked to find that macOS can take more than a second to invalidate 8 GBytes of cache loaded randomly. I didn’t remember dropping the page cache on Linux being that slow so I tried the same test on that OS and found it took even longer! Unfortunately time spent invalidating the cache is counted as part of an fio job’s runtime - if you have a job whose runtime is 1 second but invalidating the cache takes 2 seconds, then only a single I/O is done and the bandwidth reported is very low because most of the job’s time was spent invalidating the cache. Fixing this is hard because invalidation happens when a file is opened and further, each time a loop is completed the file is re-opened. For now I’ve just added an ETA status for invalidation so it’s a little bit more visible, although the status is only updated 3 seconds after the first job starts so if the initial invalidation takes less time than that the ETA status won’t show it. I can drop the ETA change if people don't feel it's useful.

macOS’s default readahead seems to be able to turn itself off fairly rapidly. I found that only random/non-sequential jobs that trick readahead into getting a few hits before seeking to a new location show better performance with readahead turned off via fadvise_hint=random. Do people still want the readahead turned off by default on macOS random workloads?

There is a 100% reproducible bug on at least macOS 15.6 where if a large sparse file is created (e.g. 1 Petabyte big), 16 random reads are done from it and the file truncated to 1 Terabyte, then truncation will spin a CPU indefinitely. When in this state, the truncation can’t be aborted/killed even during system shut down (when starting back up macOS asks the user to report an issue). Trying the same test on a file on Linux under XFS (because ext4 limits file sizes to 16 Terabytes) does not show the same behaviour - truncation finishes after a few seconds. The following reproduces the problem on both an Intel Mac and an M4 Mac:

rm -f /tmp/huge; truncate -s 1p /tmp/huge
./fio --rw=randread --number_ios=16 --filename /tmp/huge --norandommap --name=n
truncate -s 1T /tmp/huge

@Developer-Ecosystem-Engineering can you reproduce the problem too?

@sitsofe sitsofe dismissed their stale review September 3, 2025 20:12

I shouldn't review my own changes :-)

@sitsofe sitsofe requested a review from axboe September 3, 2025 20:12
@sitsofe sitsofe requested a review from vincentkfu September 3, 2025 20:12
@axboe
Copy link
Owner

axboe commented Sep 3, 2025

Apart from that one patch adding the invalidating state, looks fine to me. Maybe we just defer that decision to later and you just drop it from now? Or if you think it really needs to be there, let's here the reasoning :-)

@Developer-Ecosystem-Engineering
Copy link
Author

There is a 100% reproducible bug on at least macOS 15.6 where if a large sparse file is created (e.g. 1 Petabyte big), 16 random reads are done from it and the file truncated to 1 Terabyte, then truncation will spin a CPU indefinitely. When in this state, the truncation can’t be aborted/killed even during system shut down (when starting back up macOS asks the user to report an issue).

We reproduced it! Tracking with 159795525.

@sitsofe sitsofe force-pushed the improve_flushing_darwin branch from 714d1c3 to 407491a Compare September 4, 2025 07:12
@sitsofe
Copy link
Collaborator

sitsofe commented Sep 4, 2025

@axboe:

Apart from that one patch adding the invalidating state, looks fine to me. Maybe we just defer that decision to later and you just drop it from now? Or if you think it really needs to be there, let's here the reasoning :-)

I've become quite attached to that patch while doing this testing these changes :-) However, my sentimentality isn't enough that it has to go in now and given how rarely it shows on the ETA let's drop the patch here (I'll keep it stashed on a branch somewhere).

@Developer-Ecosystem-Engineering:

We reproduced it! Tracking with 159795525.

Thank you!

@sitsofe
Copy link
Collaborator

sitsofe commented Sep 5, 2025

A quick check @Developer-Ecosystem-Engineering, @axboe did you feel this PR needs further changes?

@axboe axboe merged commit fc8f9c7 into axboe:master Sep 6, 2025
17 checks passed
@axboe
Copy link
Owner

axboe commented Sep 6, 2025

Thanks everyone, now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants