|
| 1 | +# Benchmarks: macOS clonefile + concurrency-cap branch |
| 2 | + |
| 3 | +Reproduces the perf claims in |
| 4 | +[`HANDOFF-nativelink-macos-clonefile-optimizations.md`](../../instacart-ios/HANDOFF-nativelink-macos-clonefile-optimizations.md) |
| 5 | +on a real APFS volume. Two criterion benches, both `harness = false` under |
| 6 | +`nativelink-util/benches/`: |
| 7 | + |
| 8 | +| Bench | Proves | |
| 9 | +|---|---| |
| 10 | +| `hardlink_directory_tree` | clonefile fast path is faster than per-file hardlinks (commit `b3b0cd3f`) | |
| 11 | +| `download_to_directory_concurrency` | `buffer_unordered(64)` is not a regression vs unbounded `FuturesUnordered` (commit `1ddce0fc`) | |
| 12 | + |
| 13 | +Reproduce on macOS arm64: |
| 14 | + |
| 15 | +```bash |
| 16 | +cargo bench -p nativelink-util --bench hardlink_directory_tree |
| 17 | +cargo bench -p nativelink-util --bench download_to_directory_concurrency |
| 18 | +``` |
| 19 | + |
| 20 | +HTML reports land in `target/criterion/`. |
| 21 | + |
| 22 | +## Host |
| 23 | + |
| 24 | +| | | |
| 25 | +|---|---| |
| 26 | +| Date | 2026-05-15 | |
| 27 | +| Host | Apple M4 Max (ARM64), Darwin 25.5.0 | |
| 28 | +| FS | APFS (root volume) | |
| 29 | +| Rust | 1.94.0 | |
| 30 | +| nativelink branch | `instacart/macos-clonefile-optimizations` @ `8051ca9e` + this commit | |
| 31 | + |
| 32 | +## Layer 1 — `hardlink_directory_tree` (clonefile vs per-file hardlinks) |
| 33 | + |
| 34 | +Source tree is built once per shape; each iteration materializes into a |
| 35 | +fresh `tempfile::TempDir` destination. |
| 36 | + |
| 37 | +- **treatment** — `hardlink_directory_tree` (public API). On macOS hits |
| 38 | + `clonefile(2)` + `set_readwrite_recursive` walk. |
| 39 | +- **baseline_perfile** — `hardlink_directory_tree_perfile` |
| 40 | + (`#[doc(hidden)]` helper added for this benchmark). Per-file |
| 41 | + `fs::hard_link` walk — identical to what `hardlink_directory_tree` did |
| 42 | + on macOS prior to `b3b0cd3f`, and identical to what it still does on |
| 43 | + Linux/Windows. |
| 44 | + |
| 45 | +| shape | files | bytes/file | total | treatment | baseline | **speedup** | |
| 46 | +|---------------|------:|-----------:|--------:|----------:|---------:|------------:| |
| 47 | +| `small_flat` | 64 | 1 KiB | 64 KiB | 4.43 ms | 17.7 ms | **4.00×** | |
| 48 | +| `pcm_cluster` | 219 | 190 KiB | ~40 MiB| 15.23 ms | 61.3 ms | **4.03×** | |
| 49 | +| `deep_nested` | 200 | 256 KiB | ~50 MiB| 16.39 ms | 59.0 ms | **3.60×** | |
| 50 | +| `medium_flat` | 635 | 290 KiB | ~180 MiB| 49.03 ms | 181 ms | **3.70×** | |
| 51 | +| `large_flat` | 1,978 | 245 KiB | ~466 MiB| 150.18 ms | 590 ms | **3.93×** | |
| 52 | + |
| 53 | +Numbers are criterion's reported median; full distributions (low/median/high) |
| 54 | +are in the raw bench log. On a 466 MB / 1,978-file tree (the p95 |
| 55 | +`SwiftCompile` shape from `~/Downloads/bazel-exec-log-this.zst`) the |
| 56 | +public API drops from 590 ms to 150 ms — a **440 ms per-action |
| 57 | +materialization saving**, scaled by 814 such actions per CI build = an |
| 58 | +~**6 minute upper bound** on the saving from this single optimization. |
| 59 | + |
| 60 | +### Why it's 4× and not 10× |
| 61 | + |
| 62 | +The handoff predicted ≥ 10× on shapes ≥ 200 files based on PR |
| 63 | +[#2243][pr2243]'s reported wins. We see a stable ~4× across all five |
| 64 | +shapes. Why the gap matters less than it looks: |
| 65 | + |
| 66 | +- `clonefile(2)` itself is O(1) in tree size. |
| 67 | +- After the clone, `hardlink_directory_tree` calls |
| 68 | + `set_readwrite_recursive(dst_dir)` to chmod the cloned tree from |
| 69 | + `0o555/0o444` (inherited) to `0o755/0o644` (writable, so actions can |
| 70 | + drop outputs into the tree). That walk is **O(N) in file count** — a |
| 71 | + `read_dir` + `set_permissions` per entry. |
| 72 | +- So the "treatment" path is `O(1) clonefile + O(N) chmod walk`, not the |
| 73 | + pure `O(1)` that PR #2243's claim implied. |
| 74 | +- The 4× ratio reflects the constant per-file cost of `set_permissions` |
| 75 | + being cheaper than `hard_link` on APFS — single metadata mutation vs |
| 76 | + open-src + open-dst + link inode. |
| 77 | + |
| 78 | +This is *exactly* the failure mode the handoff flagged before deploy: |
| 79 | + |
| 80 | +> Expected treatment (clonefile + cache hit): ~0.1 – 0.3 s. If the |
| 81 | +> treatment number is ≥ 0.8 s, something is wrong — investigate before |
| 82 | +> shipping (likely candidates: clonefile silently falling through, or |
| 83 | +> `set_readwrite_recursive` walk swallowing the O(1) clone win). |
| 84 | +
|
| 85 | +Our 0.15 s on `large_flat` is well inside the green band, but the walk |
| 86 | +*is* eating most of the headroom. **Follow-up worth filing**: replace |
| 87 | +the chmod walk with a single `chmod(2)` on the top-level dst dir + lazy |
| 88 | +per-file chmod on first write, OR call out to a parallelized |
| 89 | +implementation. That should unlock the remaining 2–3×. |
| 90 | + |
| 91 | +### Acceptance verdict |
| 92 | + |
| 93 | +| Criterion | Required | Observed | Verdict | |
| 94 | +|----------------------------------------------------------|-------------------|-------------------------|---------| |
| 95 | +| macOS arm64, shapes ≥ 200 files: treatment ≥ 10× faster | ≥ 10× | 3.6× – 4.0× | ⚠️ partial — wins are real, magnitude smaller than predicted | |
| 96 | +| macOS arm64, treatment p50 on `large_flat` < 0.8 s | < 0.8 s | 0.15 s | ✅ pass | |
| 97 | +| Treatment never slower than baseline | ratio ≥ 1.0× | 3.6× – 4.0× across all | ✅ pass | |
| 98 | + |
| 99 | +**Recommend shipping.** The 4× win on the dominant SwiftCompile shape |
| 100 | +already moves the needle hard (590 ms → 150 ms per p95 action; 181 ms → |
| 101 | +49 ms per mean action). The path to 10× is a known, isolated follow-up |
| 102 | +(the chmod walk) and not a blocker for this branch. |
| 103 | + |
| 104 | +## Layer 2 — `download_to_directory` concurrency cap |
| 105 | + |
| 106 | +Replicates the C3 (`1ddce0fc`) change on the synthetic shape that |
| 107 | +mirrors `running_actions_manager::download_to_directory`: N concurrent |
| 108 | +`fs::hard_link` calls into one destination directory. |
| 109 | + |
| 110 | +- **unbounded** — pre-C3: every future on an unbounded |
| 111 | + `FuturesUnordered`, drained. |
| 112 | +- **buffered_64** — post-C3: same futures via |
| 113 | + `stream::buffer_unordered(64)`. |
| 114 | + |
| 115 | +| files (n) | unbounded | buffered_64 | ratio (buf/unb) | |
| 116 | +|----------:|----------:|------------:|----------------:| |
| 117 | +| 64 | 28.33 ms | 28.18 ms | 1.00× | |
| 118 | +| 256 | 113.87 ms | 113.53 ms | 1.00× | |
| 119 | +| 635 | 292.23 ms | 287.33 ms | 0.98× | |
| 120 | +| 1,978 | 887.33 ms | 892.77 ms | 1.01× | |
| 121 | + |
| 122 | +### Verdict |
| 123 | + |
| 124 | +| Criterion | Required | Observed | Verdict | |
| 125 | +|--------------------------------------------|-----------------|----------------|---------| |
| 126 | +| macOS, 1,978 files: buffered ≤ unbounded | ratio ≤ 1.05× | 1.01× | ✅ pass | |
| 127 | +| No size where buffered is dramatically slower | ratio ≤ 1.1× all sizes | max 1.01× | ✅ pass | |
| 128 | + |
| 129 | +The cap is **performance-neutral** on this single-process workload — |
| 130 | +which is the most important security claim for C3, since it means |
| 131 | +shipping the cap can't regress macOS workers. The handoff's hypothesis |
| 132 | +that the cap *wins* on macOS APFS (vs the unbounded path's metadata-lock |
| 133 | +contention) is **not reproduced** at this scale in a single process: |
| 134 | +APFS appears to serialize the work either way, so capping the in-flight |
| 135 | +count doesn't add or remove contention. |
| 136 | + |
| 137 | +We expect the cap's win to materialize under **multi-action contention** |
| 138 | +— several `download_to_directory` calls executing concurrently on the |
| 139 | +same worker — which a single-process microbench cannot replicate. |
| 140 | +Production telemetry (`DirectoryCache::stats()` `clonefile_hits` + |
| 141 | +APFS-lock-contention probes per the handoff's "Acceptance gate") is the |
| 142 | +right place to confirm that. |
| 143 | + |
| 144 | +### What this bench does NOT cover |
| 145 | + |
| 146 | +Documented so a reviewer doesn't mistake quiet for green: |
| 147 | + |
| 148 | +- **Multi-action contention.** Single-process bench can't show the |
| 149 | + cross-action contention that motivated C3. Need a fan-out benchmark |
| 150 | + spawning K concurrent `download_to_directory` calls. |
| 151 | +- **The chunked `has_with_results` and level-parallel BFS `mkdir` |
| 152 | + sub-changes** from PR #2243's commit `ee85fdc4` were deferred (see |
| 153 | + handoff "C3 scope deviation"). Those are not benched here because |
| 154 | + they're not implemented in this branch. |
| 155 | +- **Realistic worker path** (Layer 2 in the handoff): would spin a |
| 156 | + single-worker nativelink against a `MemoryStore` CAS preloaded with a |
| 157 | + captured SwiftCompile input tree. Not done — call out as next-step |
| 158 | + work before A/B production deploy. |
| 159 | + |
| 160 | +## Security tests added on this branch |
| 161 | + |
| 162 | +`cargo test -p nativelink-util --lib fs_util` — 10 tests, all green: |
| 163 | + |
| 164 | +| Test | Asserts | |
| 165 | +|-----------------------------------------------|-------------------------------------------------------------------------------------| |
| 166 | +| `test_hardlink_directory_tree` | macOS uses clonefile (distinct inodes); Linux uses per-file hardlinks (same inode) | |
| 167 | +| `test_clonefile_dest_is_writable` | src stays 0o555 after clone; dst becomes 0o755 | |
| 168 | +| `test_clonefile_cow_isolation` | writing to dst doesn't mutate src (COW) | |
| 169 | +| `test_clonefile_preserves_internal_symlinks` | symlinks within src are cloned as symlinks (CLONE_NOFOLLOW is top-level only) | |
| 170 | +| `test_clonefile_nofollow_on_top_level_symlink_src` | clone of a symlink src yields a symlink dst, not the target's contents | |
| 171 | +| `test_dst_under_file_parent_errors_cleanly` | error path on bad dst leaves no half-materialized tree | |
| 172 | +| `test_hardlink_nonexistent_source` | clean error on missing src | |
| 173 | +| `test_hardlink_existing_destination` | refuses pre-existing dst (would otherwise allow data leak via overlay) | |
| 174 | +| `test_set_readonly_recursive` | unchanged baseline coverage | |
| 175 | +| `test_calculate_directory_size` | unchanged baseline coverage | |
| 176 | + |
| 177 | +`cargo test -p nativelink-worker --lib directory_cache::` — 2 tests: |
| 178 | + |
| 179 | +| Test | Asserts | |
| 180 | +|-------------------------------------|---------------------------------------------------------------------------------------| |
| 181 | +| `test_directory_cache_basic` | `clonefile_hits` counter increments on macOS, `hardlink_hits` on Linux | |
| 182 | +| `test_directory_cache_zero_byte_file` | DirectoryCache construction succeeds when the CAS has no entry for zero-byte digest (C4) | |
| 183 | + |
| 184 | +[pr2243]: https://github.com/TraceMachina/nativelink/pull/2243 |
0 commit comments