directory_cache: fix CAS inode corruption from chmod-during-eviction#2347
Merged
MarcusSorealheis merged 1 commit intoMay 19, 2026
Merged
Conversation
palfrey
previously requested changes
May 19, 2026
Member
palfrey
left a comment
There was a problem hiding this comment.
Ouch, nasty one, good catch! One minor item, otherwise good.
Member
There was a problem hiding this comment.
set_readwrite_recursive and set_readwrite_one_path can be dropped as part of this, as the only user of them is removed in this PR
Contributor
Author
There was a problem hiding this comment.
Collaborator
There was a problem hiding this comment.
I think we can resolve here
Collaborator
|
🔥🔥🔥 |
Collaborator
|
@erneestoc the generated descriptions from Claude did not specify whether or not this is a breaking change which is important for our Changelog generation. I believe this is only a bugfix but it would be good to include the PR template and wrap all of your Claude info in the Description |
Collaborator
|
Nevertheless, fantastic PR and than you. |
Port of TraceMachina/nativelink PR TraceMachina#2243 commit a47774d. Root cause: when DirectoryCache evicts an entry, the cleanup path calls `set_readwrite_recursive` on the cached tree before `remove_dir_all`. That helper chmods every entry — including files — to 0o755/0o644. Files in a cached entry are hardlinked into in-flight action workspaces (via `hardlink_directory_tree` in `get_or_create`) and ultimately share an inode with the underlying `FilesystemStore` CAS blob (via `fs::hard_link` in `download_to_directory`). Chmoding the cached-side file therefore silently mutates the shared inode's mode for every other in-flight action holding a hardlink to the same blob. Production symptom: EACCES on exec for `cc_wrapper.sh`. The CAS mode of 0o555 (r-xr-xr-x) gets clobbered to 0o644 (rw-r--r--), dropping the +x bit while an unrelated action is mid-exec. Fix: introduce `set_dir_writable_recursive` which only chmods directories, never files. On unix, write permission on the parent directory is sufficient to unlink files inside; the files' own modes are irrelevant for unlinking. Switch the eviction cleanup path in `DirectoryCache::evict_lru` to the new helper. Empirically verified: a regression test that hardlinks a 0o555 file into a cached tree and runs the cleanup helper FAILS on pre-fix code (file mode mutated to 0o644) and PASSES on post-fix code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3210b02 to
1723b62
Compare
Merged
8 tasks
erneestoc
added a commit
to erneestoc/nativelink
that referenced
this pull request
May 19, 2026
…2347 cleanup After TraceMachina#2347 (DirectoryCache cleanup uses set_dir_writable_recursive) and this PR (post-clonefile path uses chmod_dir_writable), the generic set_readwrite_recursive helper has zero remaining callers. Both replacements are intentionally narrower - set_dir_writable_recursive chmods only directories so file inodes aren't mutated, and chmod_dir_writable chmods only the destination root so the clone tree stays read-only inside. Delete the old helper and its private companion set_readwrite_one_path. Addresses palfrey's review feedback on TraceMachina#2347 ("set_readwrite_recursive and set_readwrite_one_path can be dropped as part of this") - sequenced on this PR instead because both PRs had to land before the helpers became dead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MarcusSorealheis
added a commit
that referenced
this pull request
May 21, 2026
…rialization speedup) (#2349) * fs_util: skip set_readwrite_recursive walk after clonefile The macOS clonefile fast path was followed by a recursive chmod walk that made every file in the cloned tree writable (0o644 / 0o755). On real Bazel input shapes (~2000-file SwiftCompile) that walk accounted for ~46% of materialization time — ~33 µs per file, ~67 ms per action. Replace the walk with a single chmod(2) on the destination root. Existing entries inherit the source's read-only mode (0o555 dirs, 0o444 files). The worker can still create the action's declared output files inside the root because the root itself is 0o755. This matches the hermeticity contract enforced by Bazel's local sandbox (linux-sandbox bind-mounts inputs read-only; darwin-sandbox / sandbox-exec denies writes outside declared output paths) and the REAPI Action.output_files / output_directories semantics: actions write only to declared outputs, never mutate inputs. An action that does try to mutate an input now hits EACCES, which is the correct REAPI behavior — same failure mode as on Bazel's own sandbox. Bench (nativelink-util/benches/chmod_strategy.rs on the bench branch), toplevel_only vs full walk: shape walk toplevel_only speedup small_flat (64 files) 4.66 ms 2.61 ms 1.79x pcm_cluster (219 files) 15.17 ms 8.19 ms 1.85x medium_flat (635 files) 46.36 ms 25.10 ms 1.85x large_flat (1978 files) 147.39 ms 80.17 ms 1.84x set_readwrite_recursive stays public — directory_cache.rs:451 still uses it on the source side during eviction. Tests: - test_clonefile_root_writable_inputs_readonly: root 0o755, subdirs 0o555, files 0o444 (replaces the old test_clonefile_dest_is_writable which assumed subdirs would be made writable). - test_clonefile_root_accepts_new_files: worker can create outputs at the root even though everything inside the clone is read-only. - test_clonefile_input_mutation_fails: writes to existing input files fail with PermissionDenied — encodes the hermeticity contract. * fs_util: remove dead set_readwrite_recursive after post-#2347 cleanup After #2347 (DirectoryCache cleanup uses set_dir_writable_recursive) and this PR (post-clonefile path uses chmod_dir_writable), the generic set_readwrite_recursive helper has zero remaining callers. Both replacements are intentionally narrower - set_dir_writable_recursive chmods only directories so file inodes aren't mutated, and chmod_dir_writable chmods only the destination root so the clone tree stays read-only inside. Delete the old helper and its private companion set_readwrite_one_path. Addresses palfrey's review feedback on #2347 ("set_readwrite_recursive and set_readwrite_one_path can be dropped as part of this") - sequenced on this PR instead because both PRs had to land before the helpers became dead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * worker: make the APFS clonefile directory-cache path work end-to-end PR #2338's clonefile fast path is effectively dead in production — it materialized 1 of 1771 actions in a real build. Three coupled bugs kept the directory cache silently falling back to the slow download path, and two of them only surface once the fast path actually fires, so they must land together. Bug 1: prepare_action_inputs received a work directory the caller had already created, but hardlink_directory_tree and clonefile(2) both require the destination to NOT exist. Every cache attempt failed its precondition and fell back to download_to_directory. Fix: remove the empty pre-created directory before invoking the cache, and recreate it on cache failure so the download fallback (which needs an existing destination) still works. Adds fs::remove_dir for the empty-dir removal. Bug 2: set_readonly_recursive chmod'd files to 0o444, stripping the execute bit from cached executables. Once a tree is cloned into a workspace this makes an action's interpreter/wrapper script fail with EACCES. Fix: mark files 0o555 instead of 0o444 — read + execute, still no write bit, so the hermeticity contract is unchanged. Bug 3: the clonefile path chmods only the destination root writable; cloned subdirectories keep the source's 0o555 mode. Bazel actions declare outputs at paths nested inside input subdirectories, and creating those files needs write permission on the parent directory. Fix: after a cache hit, set_dir_writable_recursive makes every directory in the materialized tree writable. Files stay read-only — they may be CAS-hardlinked and chmoding them would corrupt the shared inode. Adds regression tests for nested output creation, which the existing root-only clonefile tests did not cover. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
This was referenced May 22, 2026
erneestoc
added a commit
to erneestoc/nativelink
that referenced
this pull request
May 22, 2026
`DirectoryCache` locks each cache entry down with `set_readonly_recursive` after construction. Previously that helper made the entire entry tree mode 0o555 — directories included — so every materialization had to follow up with a separate `set_dir_writable_recursive` recursive chmod walk in `prepare_action_inputs` to re-add write permission to directories (Bazel actions declare outputs at paths nested inside input subdirectories). That post-walk is redundant work. Directories are not hardlink-shared between cache entries — only file content inodes are — so directory mode can safely be made writable once, at the cache entry, instead of on every materialization. `set_readonly_recursive` now locks a tree down as a cache entry by making only FILES read-only (0o555) and leaving DIRECTORIES writable (0o755). Both materialization paths then produce a directly-usable tree: - macOS `clonefile(2)` copies the source's modes verbatim, so the clone's directories are writable and its files read-only. - The Linux per-file hardlink walk creates fresh directories (writable) and hardlinks files (which keep the source inode's read-only mode). Files stay read-only on both paths, so the hermeticity contract and the CAS-hardlink shared-inode invariant (PR TraceMachina#2347) are preserved. With the materialized tree already correct, the `set_dir_writable_recursive` call is removed from `prepare_action_inputs`. `set_dir_writable_recursive` itself is unchanged and still used by the cache eviction cleanup path. Tests: - fs_util: `test_set_readonly_recursive` now also asserts directories stay writable; the macOS clonefile tests assert cloned subdirs are writable and that a nested output can be created with no `set_dir_writable_recursive` walk; `test_set_dir_writable_recursive_walks_nested_dirs` keeps covering the eviction-cleanup helper. - directory_cache: new `test_materialized_tree_dirs_writable_files_readonly` builds a nested tree and asserts that, after `get_or_create` on both the fresh-materialize and cache-hit paths, every directory is writable and every file is read-only, with no separate chmod walk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks
MarcusSorealheis
added a commit
that referenced
this pull request
May 23, 2026
The recursive permission walk `set_perms_recursive_impl` (driving both
`set_readonly_recursive` and `set_dir_writable_recursive`) used
`fs::metadata` (stat), which follows symlinks. On input trees containing
symlinks - e.g. `.venv/bin/python3` produced by rules_python /
rules_apple venv tooling - this had two failure modes:
* A symlink to a directory reported `is_dir() == true`, so the walk
recursed *through* the link, escaping the materialized tree or
descending into an unrelated directory.
* A symlink was passed to `set_permissions`; `chmod` follows symlinks,
so it mutated the link's target. When the target did not exist (a
dangling link - common when a venv points outside the action's
input set) the `chmod` returned ENOENT and failed the entire walk.
That ENOENT failure surfaced as `set_readonly_recursive` erroring inside
`DirectoryCache::get_or_create`, which made `prepare_action_inputs` log
"Directory cache failed, falling back to traditional download" and take
the slow `download_to_directory` path.
Fix: `set_perms_recursive_impl` now uses `symlink_metadata` (lstat) and
returns early on symlink entries - it never chmods a symlink and never
recurses through one. Regular files keep their existing read-only
(0o555) treatment, so the CAS-hardlinked-inode hermeticity contract
(PR #2347) is unchanged.
`hardlink_directory_tree_recursive` already recreated symlinks as
symlinks; its symlink branch is reordered ahead of the `is_dir()` /
`is_file()` branches to make the symlink-first intent explicit and
robust.
Adds regression tests covering set-readonly, set-dir-writable, and
hardlink/clone walks over a tree containing a symlink to an in-tree
file, a dangling relative symlink, and a symlink to an in-tree
directory, asserting each walk succeeds and the symlinks are preserved
with their targets intact.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
Closed
MarcusSorealheis
added a commit
that referenced
this pull request
May 25, 2026
* util: make permission walks symlink-safe
The recursive permission walk `set_perms_recursive_impl` (driving both
`set_readonly_recursive` and `set_dir_writable_recursive`) used
`fs::metadata` (stat), which follows symlinks. On input trees containing
symlinks - e.g. `.venv/bin/python3` produced by rules_python /
rules_apple venv tooling - this had two failure modes:
* A symlink to a directory reported `is_dir() == true`, so the walk
recursed *through* the link, escaping the materialized tree or
descending into an unrelated directory.
* A symlink was passed to `set_permissions`; `chmod` follows symlinks,
so it mutated the link's target. When the target did not exist (a
dangling link - common when a venv points outside the action's
input set) the `chmod` returned ENOENT and failed the entire walk.
That ENOENT failure surfaced as `set_readonly_recursive` erroring inside
`DirectoryCache::get_or_create`, which made `prepare_action_inputs` log
"Directory cache failed, falling back to traditional download" and take
the slow `download_to_directory` path.
Fix: `set_perms_recursive_impl` now uses `symlink_metadata` (lstat) and
returns early on symlink entries - it never chmods a symlink and never
recurses through one. Regular files keep their existing read-only
(0o555) treatment, so the CAS-hardlinked-inode hermeticity contract
(PR #2347) is unchanged.
`hardlink_directory_tree_recursive` already recreated symlinks as
symlinks; its symlink branch is reordered ahead of the `is_dir()` /
`is_file()` branches to make the symlink-first intent explicit and
robust.
Adds regression tests covering set-readonly, set-dir-writable, and
hardlink/clone walks over a tree containing a symlink to an in-tree
file, a dangling relative symlink, and a symlink to an in-tree
directory, asserting each walk succeeds and the symlinks are preserved
with their targets intact.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* worker: make directory-cache entries already-writable
`DirectoryCache` locks each cache entry down with `set_readonly_recursive`
after construction. Previously that helper made the entire entry tree mode
0o555 — directories included — so every materialization had to follow up
with a separate `set_dir_writable_recursive` recursive chmod walk in
`prepare_action_inputs` to re-add write permission to directories (Bazel
actions declare outputs at paths nested inside input subdirectories).
That post-walk is redundant work. Directories are not hardlink-shared
between cache entries — only file content inodes are — so directory mode
can safely be made writable once, at the cache entry, instead of on every
materialization.
`set_readonly_recursive` now locks a tree down as a cache entry by making
only FILES read-only (0o555) and leaving DIRECTORIES writable (0o755).
Both materialization paths then produce a directly-usable tree:
- macOS `clonefile(2)` copies the source's modes verbatim, so the clone's
directories are writable and its files read-only.
- The Linux per-file hardlink walk creates fresh directories (writable)
and hardlinks files (which keep the source inode's read-only mode).
Files stay read-only on both paths, so the hermeticity contract and the
CAS-hardlink shared-inode invariant (PR #2347) are preserved. With the
materialized tree already correct, the `set_dir_writable_recursive` call
is removed from `prepare_action_inputs`. `set_dir_writable_recursive`
itself is unchanged and still used by the cache eviction cleanup path.
Tests:
- fs_util: `test_set_readonly_recursive` now also asserts directories stay
writable; the macOS clonefile tests assert cloned subdirs are writable
and that a nested output can be created with no `set_dir_writable_recursive`
walk; `test_set_dir_writable_recursive_walks_nested_dirs` keeps covering
the eviction-cleanup helper.
- directory_cache: new `test_materialized_tree_dirs_writable_files_readonly`
builds a nested tree and asserts that, after `get_or_create` on both the
fresh-materialize and cache-hit paths, every directory is writable and
every file is read-only, with no separate chmod walk.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* worker: hardlink CAS blobs in directory-cache construct
`DirectoryCache::construct_directory` previously materialized every file by
fetching the whole blob into RAM (`get_part_unchunked`) and writing a full
copy (`fs::write`). For a cache that exists to avoid re-fetching from the
CAS, this is the dominant cost on a miss.
Switch the cache-entry file build to hardlink the FilesystemStore CAS blob
directly into the cache entry — zero-copy, metadata-only — exactly the way
`download_to_directory` already does on the fallback path:
`populate_fast_store` then `get_file_entry_for_digest` /
`get_file_path_locked` / `fs::hard_link`.
Correctness:
* A hardlinked CAS blob shares its inode with the CAS store and every
other action that hardlinked the same blob, so it must never be
chmod'd (the inode-corruption bug PR #2347 fixed). Executable files
(`FileNode.is_executable`) therefore get their own private inode via
fetch+write and are chmod'd 0o555 on that unshared copy — never
hardlinked.
* When the blob is not locally hardlinkable (the fast tier is not a
FilesystemStore, or the blob is absent / evicted from it), the file
falls back to fetch+write rather than failing the build.
* Zero-byte files keep their existing direct-write special case.
* The post-construction lockdown switches from `set_readonly_recursive`
(which chmods files, and would corrupt the shared CAS inode) to
`set_dir_writable_recursive`, which only touches directories.
`DirectoryCache::new` now takes the worker's `Arc<FastSlowStore>` so it can
reach `populate_fast_store` and downcast the fast tier to FilesystemStore.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* worker: drop the two redundant full-tree walks in directory-cache build
After `construct_directory`, the cache-miss path walked the materialized
tree twice more: `calculate_directory_size` (an `fs::metadata` per file)
to compute the LRU size, and a recursive permission pass to normalize
directory modes. Both are now folded into construction itself.
Size: `construct_directory` returns the total tree size, accumulated from
`FileNode.digest.size_bytes` in the `Directory` protos it already decodes.
This is also more correct than the old filesystem walk — it counts each
file once by its CAS size and never follows symlinks into possibly-shared
or external targets. Symlinks contribute nothing.
Directory mode: each cache-entry directory is chmod'd 0o755 the moment it
is created (`create_dir_writable`), umask-independent. The directory is
writable while it is populated and that is its stable final mode, so the
separate post-construction `set_dir_writable_recursive` walk is gone.
Cache-entry files are still never chmod'd here — they may be CAS-blob
hardlinks (OPT #1) and mutating their mode would corrupt the shared inode.
Reconciliation with PR #2357: that PR reworks `set_readonly_recursive` so
the recursive walk leaves dirs 0o755 / files 0o555. This commit removes
the directory-cache build's dependence on any such recursive walk
entirely — modes are set at creation. Whichever lands second, the rebase
is a straight delete of the now-unused call site; there is no semantic
conflict because both converge on 0o755 directories, and #2357's file
handling is irrelevant here since the cache build no longer touches file
modes at all.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* worker: narrow the directory-cache lock and single-flight construction
The cache write lock was held across syscall-heavy I/O, serializing every
concurrent `get_or_create`:
* On the cache-hit paths, `cache.write()` was held across the whole
`hardlink_directory_tree` (clonefile / per-file hardlink) materialization.
* `evict_lru` ran `set_dir_writable_recursive` + `remove_dir_all` on the
evicted tree under the write lock during a cache miss.
Lock narrowing:
* `acquire_entry` / `release_entry` take the write lock only to bump and
drop a `ref_count` pin and snapshot the entry path; the
`hardlink_directory_tree` materialization runs fully unlocked. The pin is
what makes this safe — `evict_lru` never selects an entry with
`ref_count > 0`, so the cache tree cannot be deleted mid-hardlink. The
newly constructed entry is likewise inserted pre-pinned (`ref_count: 1`)
and unpinned only after its destination hardlink completes; otherwise a
concurrent miss for an unrelated digest could evict the brand-new entry
(its `last_access` is recent but it is the only unpinned one) while this
caller is still hardlinking from it.
* `evict_if_needed` / `evict_lru` are now pure in-memory: they select
victims and remove them from the map under the lock, returning the
victim paths. `dispatch_evictions` then performs the chmod + removal on
a `background_spawn` task, off the lock.
Single-flight: the existing per-digest construction mutex already ensures a
digest is constructed once while N callers wait; this commit additionally
unmaps the per-digest mutex (`forget_construction_lock`) once construction
finishes so `construction_locks` no longer grows unbounded over the worker's
lifetime. Unmapping is race-free: a waiter has already cloned the `Arc<Mutex>`
before blocking, and a late arrival that creates a fresh mutex still re-checks
the cache, finds the entry, and takes the fast hardlink path — never a
redundant construct.
`ref_count` / `CachedDirectoryMetadata` semantics are unchanged; the
hit/miss return contract is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* address merge interactions with read-only CAS
* remove inode stat in test
* update dependencies to 1.3.1
* worker: materialize executable inputs by hardlink to a created-once 0o555 variant (fix ETXTBSY)
Materializing an executable input via a per-action `std::fs::copy` opened a
writable fd in the worker's hot prepare path. Under fork-heavy concurrency a
sibling action's forked child could inherit that fd, and a concurrent `execve`
of the executable then failed with `ETXTBSY` ("Text file busy", os error 26) —
seen on Linux RBE (k8) building rules_go's `builder`. macOS was largely shielded
because its directory-cache path uses APFS `clonefile(2)` (a distinct COW inode
per action), but the per-file `download_to_directory` fallback hardlinks on both
platforms, so the regression spanned both.
Fix (keep the hot path hardlink-only — no writable fd):
- nativelink-store: add `FilesystemStore::get_executable_hardlink_source`. The
CAS blob is read-only 0o444 and shared by hardlink, so it cannot carry +x and
must never be chmod'd (#2347). This creates a per-digest 0o555 variant exactly
once (single-flight), copy -> chmod -> fsync -> atomic rename, so the writer fd
is closed before the inode is ever hardlinked or executed. Stored in a sibling
`{content_path}.exec` dir (ignored by the content/temp scan + prune) and
cleared on startup. On APFS the copy is itself a `clonefile`.
- download_to_directory: executables now hardlink that shared 0o555 variant and
non-executables hardlink the 0o444 CAS blob. A private copy is used only for
the rare custom unix_mode / mtime case, applied to a private inode.
The macOS `clonefile` materialization (`hardlink_directory_tree`, #2349) and the
directory cache's executable handling are left untouched, preserving the macOS
speedup.
Test: executable_hardlink_source_created_once_and_readonly asserts the variant is
0o555, a separate inode from the 0o444 blob, stable across calls, leaves the blob
untouched, and hardlinks into an executable. nativelink-store 243/0,
nativelink-worker 88/0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* filesystem_store: use nativelink spawn_blocking! macro (clippy disallowed_methods)
tokio::task::spawn_blocking is banned by clippy.toml in favor of
nativelink-util's spawn_blocking! macro (adds the tracing span +
JoinHandleDropGuard). Fixes the -D clippy::disallowed-methods CI failure on
get_executable_hardlink_source's executable-variant creation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* filesystem_store: gate executable-variant machinery to unix (fix Windows build)
The executable 0o555 variant (and its single-flight map, variant path, .exec
dir, and spawn_blocking copy) only exists to carry the unix executable bit and
dodge the unix ETXTBSY race. On Windows it was dead code, failing the build
under -D warnings (unused import spawn_blocking, never-read executable_locks,
never-used executable_variant_path). Gate all of it (and the HashMap / Mutex /
EXECUTABLE_DIR_SUFFIX it pulls in) behind #[cfg(unix)]; the existing
#[cfg(not(unix))] get_executable_hardlink_source just hardlinks the CAS blob.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ernesto Cambuston <e.cambuston@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
DirectoryCache::evict_lrucallsset_readwrite_recursiveto make the cache entry writable beforeremove_dir_all. That helper chmods both directories AND files. The files in the cache entry are hardlinked from the CAS — so chmod follows the hardlink and mutates the CAS file's mode mid-flight.End result: a cached executable like
cc_wrapper.shwith CAS mode0o555gets silently clobbered to0o644when an unrelated DirectoryCache eviction runs. The next action that materializes that CAS file hitsEACCESon exec. The symptom looks like a totally unrelated bug because the eviction and the failing action are in different code paths.Fix: introduce
set_dir_writable_recursiveinnativelink-util/src/fs_util.rsthat chmods only directories (which is all you need to enableunlinkon the files inside), and swapevict_lruto use it.Why this matters
This is silent data corruption that may be happening in any deployment running the DirectoryCache today. The symptom (EACCES from a cached file) presents as "my toolchain wrapper script suddenly stopped working" — operators correlate it with whatever they last touched, not with the eviction path of an unrelated action.
Relationship to in-flight PRs
The two existing callers of
set_readwrite_recursiveare being replaced by safer variants across two PRs:DirectoryCache::evict_lrucaller withset_dir_writable_recursivehardlink_directory_treecaller (introduced by macOS: APFS clonefile fast path + concurrency cap + zero-byte fix for Bazel input materialization #2338) withchmod_dir_writableAfter both land,
set_readwrite_recursiveand its private helperset_readwrite_one_pathwill have zero callers and can be removed in a small follow-up PR.Provenance
Equivalent to upstream commit
a47774d544from #2243. Adapted for our codebase: the original commit was ~32 LOC removing a shell-execchmod -R u+rw; our codebase uses an in-Rust helper, so the port had to be reshaped to chmod only directories (preservingremove_dir_all's ability to unlink files inside) rather than skipping chmod entirely.Type of change
How Has This Been Tested?
Added
test_eviction_cleanup_preserves_hardlinked_file_modeinnativelink-worker/src/directory_cache.rs. The test creates a file with mode0o444, hardlinks it into a cache entry (simulating CAS sharing), runs the eviction path, and asserts the original file's mode is unchanged.Empirical FAIL-at-HEAD~1 / PASS-at-HEAD proof: locally swapped the fix back to call
set_readwrite_recursiveand re-ran — test fails withassertion left == right failed: eviction cleanup mutated the inode mode (was 0o444, now 0o644). Restored the fix → passes.cargo build -p nativelink-util -p nativelink-workercleancargo test -p nativelink-worker --lib directory_cache::2/2 pass (3/3 post-rebase, includes upstream'stest_directory_cache_zero_byte_file)cargo clippy -p nativelink-worker --lib --tests -- -D warningsclean for changed filescargo fmt --checkcleanclippy.toml— no matches in diffChecklist
bazel test //...passes locally (verified viacargoonly)This change is