Commit 9ab08e3
running_actions_manager: bounded retry on hard_link NotFound to fix loser-entry emplace race
The first reader-side fix (commit a2f4f4a) wrapped `fs::hard_link`
inside `FileEntry::get_file_path_locked` so the per-entry read lock
was held across the syscall. Buildstream CI on PR #2341 STILL
surfaced the same ENOENT "file was likely evicted from cache"
failure, hitting the action's max-retry budget (4 > 3) and
cancelling jobs that referenced files like `usr/bin/perl5.26.1`
across multiple digests.
The first fix protected the WINNER-entry case: a reader holding
`Arc<entry_A>` cannot have entry_A's content file renamed out
mid-hard_link, because a concurrent `unref(entry_A)` (which takes
the same RwLock as a writer) must wait for the reader's read lock.
It did NOT protect the LOSER-entry case:
1. Reader R calls `get_file_entry_for_digest(d)` and the
evicting map returns `Arc<entry_A>` (the entry currently
under d in the map).
2. Concurrent writer B finishes an emplace for the SAME key.
`EvictingMap::insert(B)` displaces A, calling `unref(A)`.
`unref(A)` takes A's write lock and renames A's file from
`content_path/<d>` to A's own temp path. A's file is now
gone from the content path.
3. R then calls `get_file_path_locked(A)`. The read lock now
correctly serializes with the (already-completed) unref —
but R's captured path points at A, and A's file is gone.
`fs::hard_link` returns ENOENT. The CAS still has the digest
under entry_B's content path; B is in the map; B's file is
on disk. Re-fetching the entry from the map returns B and
a second hard_link attempt against B's path succeeds.
Fix: bounded retry inside `download_to_directory`. On Code::NotFound
the reader sleeps for a 10ms backoff (giving any racing writer's
`emplace_file` background spawn time to finish renaming the temp
into the content path), re-fetches the entry from the map (which
now returns the winning writer's entry), and retries the hard_link.
Capped at HARDLINK_MAX_RETRIES = 3 so that genuine eviction-pressure
ENOENT (no writer racing, the digest truly is gone) cannot spin —
the existing "max_bytes too small" guidance is preserved on the
post-budget error path.
Retry budget choice: 3 attempts (= 1 original + 2 retries).
Production traces show at most one displacement per concurrent
write cycle for a given digest; a single retry resolves the
documented race. Two retries gives one extra slot of headroom for
the rare case where a third writer enters the cycle between
attempts. Going higher risks masking the real eviction-pressure
case the original error message was designed to surface.
Test: download_to_directory_retries_when_entry_evicted_between_lookup_and_hardlink
* Pre-populates digest with entry_A in the evicting_map at
content_path/<d>.
* Constructs a synthetic entry_B pointing at the same content
path (via test-only constructors on `SharedContext` and
`EncodedFilePath` added in this commit) and inserts it under
the same key. The map's real `insert` calls `LenEntry::unref(A)`
which renames A's file out — same code path as a real writer's
displacement.
* Spawns the reader (`download_to_directory`). Spawns a
restorer task that sleeps 2ms then writes fresh bytes back
into content_path/<d> (mimicking writer B's emplace having
completed its rename).
* Without the retry the reader's single hard_link runs against
the still-missing content path and fails with NotFound. With
the retry, attempt 1 fails, the loop sleeps 10ms (during
which the restorer's write lands), attempt 2 finds the file
and the hard_link succeeds.
Empirical FAIL-at-HEAD~1 / PASS-at-HEAD proof:
* Locally toggled HARDLINK_MAX_RETRIES from 3 to 1 (effectively
no retry); the new test failed deterministically with:
"Error { code: NotFound, messages: [\"No such file or
directory (os error 2)\", \"Could not make hardlink ... after
1 attempts ...\"] }"
* Restored HARDLINK_MAX_RETRIES = 3; the new test passes 5/5
runs in ~0.03s each, and the full
running_actions_manager_test binary passes all 32 tests
(the prior 31 + the new one).
Re: TODO #2051 (deadlock with large number of files) — unchanged
from the analysis in the parent commit. The per-FileEntry read
lock is per-digest, not global; concurrency for download_to_directory
remains governed by `fs::hard_link`'s open-file semaphore.
Scaffolding additions in nativelink-store are doc(hidden) and
*_for_test-named:
* `FsEvictingMap` type alias made pub so external tests can
name the evicting_map return type.
* `SharedContext::new_for_test(temp_path, content_path)`.
* `EncodedFilePath::new_content_for_test(shared_ctx, key)`.
* `FilesystemStore::evicting_map_for_test()`.
None are used by production code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent a2f4f4a commit 9ab08e3
3 files changed
Lines changed: 483 additions & 55 deletions
File tree
- nativelink-store/src
- nativelink-worker
- src
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
78 | 94 | | |
79 | 95 | | |
80 | 96 | | |
| |||
99 | 115 | | |
100 | 116 | | |
101 | 117 | | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
102 | 135 | | |
103 | 136 | | |
104 | 137 | | |
| |||
444 | 477 | | |
445 | 478 | | |
446 | 479 | | |
447 | | - | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
448 | 487 | | |
449 | 488 | | |
450 | 489 | | |
| |||
748 | 787 | | |
749 | 788 | | |
750 | 789 | | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
751 | 800 | | |
752 | 801 | | |
753 | 802 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
157 | 157 | | |
158 | 158 | | |
159 | 159 | | |
160 | | - | |
161 | | - | |
162 | | - | |
163 | | - | |
164 | | - | |
165 | | - | |
166 | | - | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
171 | | - | |
172 | | - | |
173 | | - | |
174 | | - | |
175 | | - | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
176 | 193 | | |
177 | 194 | | |
178 | 195 | | |
179 | | - | |
180 | | - | |
181 | | - | |
182 | | - | |
183 | | - | |
184 | | - | |
185 | | - | |
186 | | - | |
187 | | - | |
188 | | - | |
189 | | - | |
190 | | - | |
191 | | - | |
192 | | - | |
193 | | - | |
194 | | - | |
195 | | - | |
196 | | - | |
197 | | - | |
198 | | - | |
199 | | - | |
200 | | - | |
201 | | - | |
202 | | - | |
203 | | - | |
204 | | - | |
205 | | - | |
206 | | - | |
207 | | - | |
208 | | - | |
209 | | - | |
210 | | - | |
211 | | - | |
212 | | - | |
213 | | - | |
214 | | - | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
215 | 240 | | |
216 | | - | |
217 | | - | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
218 | 278 | | |
219 | 279 | | |
220 | 280 | | |
| |||
0 commit comments