Skip to content

Conversation

kitten
Copy link
Contributor

@kitten kitten commented Aug 17, 2025

Summary

Hardlinking/cloning is increasingly common in package managers, when a file system allows for efficient copy-on-write hardlinks/clones. Bun and pnpm use this for dependency installations.

When installing a new package, package managers may create a hardlink of a package in node_modules (or an isolated dependencies sub-folder) that is a clone of a Node module directory in their cache folders. When this happens, since a copy-on-write clone operation may have occurred, only a single file-system event is emitted for the entire directory. This reproduces consistently on APFS.

The NativeWatcher does not recognise this and won't emit events for the cloned directory's contents, which means that the change events aren't successfully updating the FileMap.

Instead, we can:

  • batch change events in a macro-tick
  • detect hardlink directory candidates without any present contents
  • crawl these directories manually and emit events for them

Changelog: [Fix] Fix added directories not being detected properly by the NativeWatcher when they're cloned using a hardlink/copy-on-write operation (typically, as Bun or pnpm install new packages)

Test plan

  • Start up a project with metro (e.g. an Expo project) that misses a package or start using an uninstalled package (e.g. expo-image)
    • Setting DEBUG=Metro:NativeWatcher to observe watcher events
  • Bundling should observe that this package does not resolve
  • With Bun, bun add expo-image, or pnpm, pnpm add expo-image
  • Without the fix: We can observe that directory events are reported for the hardlinked directories without sub-directory/file events
  • With the fix: Traversal adds events for the added sub-directories/files

I've tested that removing the package again or modifying it works as expected. I didn't observe duplicate events, but I believe the parent classes filter out duplicates, if any occur.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 17, 2025
@facebook-github-bot facebook-github-bot added the Shared with Meta Applied via automation to indicate that an Issue or Pull Request has been shared with the team. label Aug 17, 2025
@kitten kitten force-pushed the @kitten/fix/native-watcher-symlink-order branch from 97ac278 to c82b33d Compare August 17, 2025 15:04
@kitten kitten force-pushed the @kitten/fix/native-watcher-symlink-order branch from c82b33d to c843fa7 Compare August 17, 2025 15:08
kitten added 2 commits August 17, 2025 16:29
… are emitted

We can do a bit better with preventing excessive events by batching all
incoming change events in a macro tick. In each tick, we'll process the
events, and filter them as usual, but also mark hardlink candidates, as
before.

However, we now omit hardlink candidates, if any sub-files/directories
are in the batch, which indicates usually that `fs.watch` will deliver
children and the candidate is not a clone/hardlink/copy-on-write directory.

We then also batch the outputs and process all pending ones in the same
tick. Due to the async/await logic, we usually won't process inputs and
outputs in the same tick.
@kitten
Copy link
Contributor Author

kitten commented Aug 17, 2025

I've pushed a new implementation that attempts to avoid duplication by adding an input/output batch. This now skips hardlink directory candidates from being crawled, if the input batch contains any subpaths. This seems to behave much more predictably and the integration tests are passing with this change

@kitten kitten changed the title Fix NativeWatcher not ingesting hardlinked directories due to omitted events Fix NativeWatcher not ingesting hardlinked/cloned directories (Bun/pnpm installs) due to omitted events Aug 17, 2025
@robhogan
Copy link
Contributor

This looks awesome @kitten - thank you!

The NativeWatcher does not recognise this and won't emit events for the cloned directory's contents, which means that the change events aren't successfully updating the FileMap.

Do you happen to know if this is -
a) A regression since #1420 ? (Evan mentioned issues installing packages with bun around that time)
b) a difference in behaviour from Watchman or the FallbackWatcher?

We could do with some integration tests using hard links with the different backends but you can leave that to me - thanks again for the PR and I'll make some time for a detailed review and testing this week.

@kitten
Copy link
Contributor Author

kitten commented Aug 19, 2025

It's b). Specifically, for some syscalls it seems like fs.watch will only receive a directory change event

// for manual crawling if there's no other event that has a child path in this directory.
// If we have a child path, then we can rely on `fs.watch` reporting changes for
// this directory's files
if (type === 'd' && isMaybeHardlinked(stat)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you find this code path is hit with Bun package manager? I'm trying to write some integration tests for it at the moment and having a bit of trouble reproducing a directory with nlink > 1.

AFAICT, pnpm does not hard link directories as such, rather it recursively hard links file contents.

It looks like Bun uses the clonefile() (copy on write) C sys call, but according to the man page for that, it also just clones each file:

If src names a directory, the directory hierarchy is cloned as if each item
was cloned individually.

It does seem like it's possible to hard link a directory but it's pretty buried, has specific constraints, and creates hard links that some shell commands don't understand: https://jameshunt.us/writings/linking-dirs-in-macos/ - is that was this needs to support?

Copy link
Contributor Author

@kitten kitten Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed the pnpm difference, yes. bun seems to hit some kind of special case sometimes where it's using a single syscall to "clone" that target directory, which triggers the issue pretty reliably. What made me a bit suspicious is this comparison of syscall counts: https://x.com/lydiahallie/status/1955671845384142963

In my testing (with Bun >=1.2.19), it reproduced reliably. Could be filesystem specific? Maybe it only does that special clone operation on APFS?
(Edit: Also probably worth mentioning that I tested this against an expo-* package, so I knew it was copied to multiple projects that I've got checked out)

What my suspicion was that APFS (like some other Linux file-systems) supports a copy-on-write clone operation that can operate on an entire directory, so like a hardlink, but without having to recursively hardlink files-only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after another look it seems to me like the nlink for a directory on APFS has little to do with hard links - it's simply the number of entries inside the directory (+2 for . and ..).

So nlink > 1 should always hold on APFS, and the logic in this PR boils down to "if we see a directory event and no child events within a tick, crawl the directory" - which is probably fine in general. But if the directory isn't in fact a hard link, but a recursive creation of file hardlinks/reflinks, might the crawl race with the creation of the contents?

I'm curious about what it is fs.watch is doing differently vs Watchman and/or fsevents. And regardless of whether we work around this in Metro - is it a Node.js bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah haha, well that explains that. I must've gotten confused with nlink on file vs folders. Yes, I did see race conditions. This generally leads to a double-report which wasn't that bad but did break tests here and there.

Adding the tick batching worked around this. Node tries to report changes in batches, so generally they're all expected to mostly arrive in one tick. That's not the case for Windows, which is why on that experimental branch for Windows that branches off of this one, the tick length is increased with a timeout.

I think whether it's a Node bug or not boils down whether they've implemented logic like this or not. I'm assuming that the events they're receiving have the same limitations we see here, and the question is whether they ever perform crawls or intend to, to guarantee all recursive files to always have individual events

Copy link
Contributor Author

@kitten kitten Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a section in this blog post calls out the exact syscall for macOS explicitly: https://bun.com/blog/behind-the-scenes-of-bun-install#macos (clonefile or clonefile_each_dir)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Shared with Meta Applied via automation to indicate that an Issue or Pull Request has been shared with the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants