Skip to content

Commit 9e357d8

Browse files
committed
nativelink-util: criterion benches + security tests for macOS clonefile branch
Two criterion benches under nativelink-util/benches/ and three additional security tests in fs_util.rs that together prove (a) the clonefile fast path on macOS is faster than per-file hardlinks and (b) the download_to_directory concurrency cap is not a regression. Results on Apple M4 Max / APFS (full table in BENCHMARKS.md): - hardlink_directory_tree, 1978 files / 466 MB: 150 ms vs 590 ms = 3.93x - hardlink_directory_tree, 635 files / 180 MB: 49 ms vs 181 ms = 3.70x - download_to_directory_concurrency, n=1978: 893 ms vs 887 ms = 1.01x (perf-neutral on single-process, as expected) The treatment-vs-baseline comparison uses a new #[doc(hidden)] hardlink_directory_tree_perfile helper that forces the per-file path so both can be measured on the same host. Security tests added (all green): - test_clonefile_preserves_internal_symlinks (symlinks inside the tree are cloned as symlinks, not followed) - test_clonefile_nofollow_on_top_level_symlink_src (top-level symlink src is cloned as a symlink, not followed) - test_dst_under_file_parent_errors_cleanly (clean error, no half- materialized tree on bad dst parent)
1 parent 8051ca9 commit 9e357d8

6 files changed

Lines changed: 857 additions & 6 deletions

File tree

Cargo.lock

Lines changed: 97 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

nativelink-util/BENCHMARKS.md

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Benchmarks: macOS clonefile + concurrency-cap branch
2+
3+
Reproduces the perf claims in
4+
[`HANDOFF-nativelink-macos-clonefile-optimizations.md`](../../instacart-ios/HANDOFF-nativelink-macos-clonefile-optimizations.md)
5+
on a real APFS volume. Two criterion benches, both `harness = false` under
6+
`nativelink-util/benches/`:
7+
8+
| Bench | Proves |
9+
|---|---|
10+
| `hardlink_directory_tree` | clonefile fast path is faster than per-file hardlinks (commit `b3b0cd3f`) |
11+
| `download_to_directory_concurrency` | `buffer_unordered(64)` is not a regression vs unbounded `FuturesUnordered` (commit `1ddce0fc`) |
12+
13+
Reproduce on macOS arm64:
14+
15+
```bash
16+
cargo bench -p nativelink-util --bench hardlink_directory_tree
17+
cargo bench -p nativelink-util --bench download_to_directory_concurrency
18+
```
19+
20+
HTML reports land in `target/criterion/`.
21+
22+
## Host
23+
24+
| | |
25+
|---|---|
26+
| Date | 2026-05-15 |
27+
| Host | Apple M4 Max (ARM64), Darwin 25.5.0 |
28+
| FS | APFS (root volume) |
29+
| Rust | 1.94.0 |
30+
| nativelink branch | `instacart/macos-clonefile-optimizations` @ `8051ca9e` + this commit |
31+
32+
## Layer 1 — `hardlink_directory_tree` (clonefile vs per-file hardlinks)
33+
34+
Source tree is built once per shape; each iteration materializes into a
35+
fresh `tempfile::TempDir` destination.
36+
37+
- **treatment**`hardlink_directory_tree` (public API). On macOS hits
38+
`clonefile(2)` + `set_readwrite_recursive` walk.
39+
- **baseline_perfile**`hardlink_directory_tree_perfile`
40+
(`#[doc(hidden)]` helper added for this benchmark). Per-file
41+
`fs::hard_link` walk — identical to what `hardlink_directory_tree` did
42+
on macOS prior to `b3b0cd3f`, and identical to what it still does on
43+
Linux/Windows.
44+
45+
| shape | files | bytes/file | total | treatment | baseline | **speedup** |
46+
|---------------|------:|-----------:|--------:|----------:|---------:|------------:|
47+
| `small_flat` | 64 | 1 KiB | 64 KiB | 4.43 ms | 17.7 ms | **4.00×** |
48+
| `pcm_cluster` | 219 | 190 KiB | ~40 MiB| 15.23 ms | 61.3 ms | **4.03×** |
49+
| `deep_nested` | 200 | 256 KiB | ~50 MiB| 16.39 ms | 59.0 ms | **3.60×** |
50+
| `medium_flat` | 635 | 290 KiB | ~180 MiB| 49.03 ms | 181 ms | **3.70×** |
51+
| `large_flat` | 1,978 | 245 KiB | ~466 MiB| 150.18 ms | 590 ms | **3.93×** |
52+
53+
Numbers are criterion's reported median; full distributions (low/median/high)
54+
are in the raw bench log. On a 466 MB / 1,978-file tree (the p95
55+
`SwiftCompile` shape from `~/Downloads/bazel-exec-log-this.zst`) the
56+
public API drops from 590 ms to 150 ms — a **440 ms per-action
57+
materialization saving**, scaled by 814 such actions per CI build = an
58+
~**6 minute upper bound** on the saving from this single optimization.
59+
60+
### Why it's 4× and not 10×
61+
62+
The handoff predicted ≥ 10× on shapes ≥ 200 files based on PR
63+
[#2243][pr2243]'s reported wins. We see a stable ~4× across all five
64+
shapes. Why the gap matters less than it looks:
65+
66+
- `clonefile(2)` itself is O(1) in tree size.
67+
- After the clone, `hardlink_directory_tree` calls
68+
`set_readwrite_recursive(dst_dir)` to chmod the cloned tree from
69+
`0o555/0o444` (inherited) to `0o755/0o644` (writable, so actions can
70+
drop outputs into the tree). That walk is **O(N) in file count** — a
71+
`read_dir` + `set_permissions` per entry.
72+
- So the "treatment" path is `O(1) clonefile + O(N) chmod walk`, not the
73+
pure `O(1)` that PR #2243's claim implied.
74+
- The 4× ratio reflects the constant per-file cost of `set_permissions`
75+
being cheaper than `hard_link` on APFS — single metadata mutation vs
76+
open-src + open-dst + link inode.
77+
78+
This is *exactly* the failure mode the handoff flagged before deploy:
79+
80+
> Expected treatment (clonefile + cache hit): ~0.1 – 0.3 s. If the
81+
> treatment number is ≥ 0.8 s, something is wrong — investigate before
82+
> shipping (likely candidates: clonefile silently falling through, or
83+
> `set_readwrite_recursive` walk swallowing the O(1) clone win).
84+
85+
Our 0.15 s on `large_flat` is well inside the green band, but the walk
86+
*is* eating most of the headroom. **Follow-up worth filing**: replace
87+
the chmod walk with a single `chmod(2)` on the top-level dst dir + lazy
88+
per-file chmod on first write, OR call out to a parallelized
89+
implementation. That should unlock the remaining 2–3×.
90+
91+
### Acceptance verdict
92+
93+
| Criterion | Required | Observed | Verdict |
94+
|----------------------------------------------------------|-------------------|-------------------------|---------|
95+
| macOS arm64, shapes ≥ 200 files: treatment ≥ 10× faster | ≥ 10× | 3.6× – 4.0× | ⚠️ partial — wins are real, magnitude smaller than predicted |
96+
| macOS arm64, treatment p50 on `large_flat` < 0.8 s | < 0.8 s | 0.15 s | ✅ pass |
97+
| Treatment never slower than baseline | ratio ≥ 1.0× | 3.6× – 4.0× across all | ✅ pass |
98+
99+
**Recommend shipping.** The 4× win on the dominant SwiftCompile shape
100+
already moves the needle hard (590 ms → 150 ms per p95 action; 181 ms →
101+
49 ms per mean action). The path to 10× is a known, isolated follow-up
102+
(the chmod walk) and not a blocker for this branch.
103+
104+
## Layer 2 — `download_to_directory` concurrency cap
105+
106+
Replicates the C3 (`1ddce0fc`) change on the synthetic shape that
107+
mirrors `running_actions_manager::download_to_directory`: N concurrent
108+
`fs::hard_link` calls into one destination directory.
109+
110+
- **unbounded** — pre-C3: every future on an unbounded
111+
`FuturesUnordered`, drained.
112+
- **buffered_64** — post-C3: same futures via
113+
`stream::buffer_unordered(64)`.
114+
115+
| files (n) | unbounded | buffered_64 | ratio (buf/unb) |
116+
|----------:|----------:|------------:|----------------:|
117+
| 64 | 28.33 ms | 28.18 ms | 1.00× |
118+
| 256 | 113.87 ms | 113.53 ms | 1.00× |
119+
| 635 | 292.23 ms | 287.33 ms | 0.98× |
120+
| 1,978 | 887.33 ms | 892.77 ms | 1.01× |
121+
122+
### Verdict
123+
124+
| Criterion | Required | Observed | Verdict |
125+
|--------------------------------------------|-----------------|----------------|---------|
126+
| macOS, 1,978 files: buffered ≤ unbounded | ratio ≤ 1.05× | 1.01× | ✅ pass |
127+
| No size where buffered is dramatically slower | ratio ≤ 1.1× all sizes | max 1.01× | ✅ pass |
128+
129+
The cap is **performance-neutral** on this single-process workload —
130+
which is the most important security claim for C3, since it means
131+
shipping the cap can't regress macOS workers. The handoff's hypothesis
132+
that the cap *wins* on macOS APFS (vs the unbounded path's metadata-lock
133+
contention) is **not reproduced** at this scale in a single process:
134+
APFS appears to serialize the work either way, so capping the in-flight
135+
count doesn't add or remove contention.
136+
137+
We expect the cap's win to materialize under **multi-action contention**
138+
— several `download_to_directory` calls executing concurrently on the
139+
same worker — which a single-process microbench cannot replicate.
140+
Production telemetry (`DirectoryCache::stats()` `clonefile_hits` +
141+
APFS-lock-contention probes per the handoff's "Acceptance gate") is the
142+
right place to confirm that.
143+
144+
### What this bench does NOT cover
145+
146+
Documented so a reviewer doesn't mistake quiet for green:
147+
148+
- **Multi-action contention.** Single-process bench can't show the
149+
cross-action contention that motivated C3. Need a fan-out benchmark
150+
spawning K concurrent `download_to_directory` calls.
151+
- **The chunked `has_with_results` and level-parallel BFS `mkdir`
152+
sub-changes** from PR #2243's commit `ee85fdc4` were deferred (see
153+
handoff "C3 scope deviation"). Those are not benched here because
154+
they're not implemented in this branch.
155+
- **Realistic worker path** (Layer 2 in the handoff): would spin a
156+
single-worker nativelink against a `MemoryStore` CAS preloaded with a
157+
captured SwiftCompile input tree. Not done — call out as next-step
158+
work before A/B production deploy.
159+
160+
## Security tests added on this branch
161+
162+
`cargo test -p nativelink-util --lib fs_util` — 10 tests, all green:
163+
164+
| Test | Asserts |
165+
|-----------------------------------------------|-------------------------------------------------------------------------------------|
166+
| `test_hardlink_directory_tree` | macOS uses clonefile (distinct inodes); Linux uses per-file hardlinks (same inode) |
167+
| `test_clonefile_dest_is_writable` | src stays 0o555 after clone; dst becomes 0o755 |
168+
| `test_clonefile_cow_isolation` | writing to dst doesn't mutate src (COW) |
169+
| `test_clonefile_preserves_internal_symlinks` | symlinks within src are cloned as symlinks (CLONE_NOFOLLOW is top-level only) |
170+
| `test_clonefile_nofollow_on_top_level_symlink_src` | clone of a symlink src yields a symlink dst, not the target's contents |
171+
| `test_dst_under_file_parent_errors_cleanly` | error path on bad dst leaves no half-materialized tree |
172+
| `test_hardlink_nonexistent_source` | clean error on missing src |
173+
| `test_hardlink_existing_destination` | refuses pre-existing dst (would otherwise allow data leak via overlay) |
174+
| `test_set_readonly_recursive` | unchanged baseline coverage |
175+
| `test_calculate_directory_size` | unchanged baseline coverage |
176+
177+
`cargo test -p nativelink-worker --lib directory_cache::` — 2 tests:
178+
179+
| Test | Asserts |
180+
|-------------------------------------|---------------------------------------------------------------------------------------|
181+
| `test_directory_cache_basic` | `clonefile_hits` counter increments on macOS, `hardlink_hits` on Linux |
182+
| `test_directory_cache_zero_byte_file` | DirectoryCache construction succeeds when the CAS has no entry for zero-byte digest (C4) |
183+
184+
[pr2243]: https://github.com/TraceMachina/nativelink/pull/2243

0 commit comments

Comments
 (0)