Skip to content

Commit 9ab08e3

Browse files
erneestocclaude
andcommitted
running_actions_manager: bounded retry on hard_link NotFound to fix loser-entry emplace race
The first reader-side fix (commit a2f4f4a) wrapped `fs::hard_link` inside `FileEntry::get_file_path_locked` so the per-entry read lock was held across the syscall. Buildstream CI on PR #2341 STILL surfaced the same ENOENT "file was likely evicted from cache" failure, hitting the action's max-retry budget (4 > 3) and cancelling jobs that referenced files like `usr/bin/perl5.26.1` across multiple digests. The first fix protected the WINNER-entry case: a reader holding `Arc<entry_A>` cannot have entry_A's content file renamed out mid-hard_link, because a concurrent `unref(entry_A)` (which takes the same RwLock as a writer) must wait for the reader's read lock. It did NOT protect the LOSER-entry case: 1. Reader R calls `get_file_entry_for_digest(d)` and the evicting map returns `Arc<entry_A>` (the entry currently under d in the map). 2. Concurrent writer B finishes an emplace for the SAME key. `EvictingMap::insert(B)` displaces A, calling `unref(A)`. `unref(A)` takes A's write lock and renames A's file from `content_path/<d>` to A's own temp path. A's file is now gone from the content path. 3. R then calls `get_file_path_locked(A)`. The read lock now correctly serializes with the (already-completed) unref — but R's captured path points at A, and A's file is gone. `fs::hard_link` returns ENOENT. The CAS still has the digest under entry_B's content path; B is in the map; B's file is on disk. Re-fetching the entry from the map returns B and a second hard_link attempt against B's path succeeds. Fix: bounded retry inside `download_to_directory`. On Code::NotFound the reader sleeps for a 10ms backoff (giving any racing writer's `emplace_file` background spawn time to finish renaming the temp into the content path), re-fetches the entry from the map (which now returns the winning writer's entry), and retries the hard_link. Capped at HARDLINK_MAX_RETRIES = 3 so that genuine eviction-pressure ENOENT (no writer racing, the digest truly is gone) cannot spin — the existing "max_bytes too small" guidance is preserved on the post-budget error path. Retry budget choice: 3 attempts (= 1 original + 2 retries). Production traces show at most one displacement per concurrent write cycle for a given digest; a single retry resolves the documented race. Two retries gives one extra slot of headroom for the rare case where a third writer enters the cycle between attempts. Going higher risks masking the real eviction-pressure case the original error message was designed to surface. Test: download_to_directory_retries_when_entry_evicted_between_lookup_and_hardlink * Pre-populates digest with entry_A in the evicting_map at content_path/<d>. * Constructs a synthetic entry_B pointing at the same content path (via test-only constructors on `SharedContext` and `EncodedFilePath` added in this commit) and inserts it under the same key. The map's real `insert` calls `LenEntry::unref(A)` which renames A's file out — same code path as a real writer's displacement. * Spawns the reader (`download_to_directory`). Spawns a restorer task that sleeps 2ms then writes fresh bytes back into content_path/<d> (mimicking writer B's emplace having completed its rename). * Without the retry the reader's single hard_link runs against the still-missing content path and fails with NotFound. With the retry, attempt 1 fails, the loop sleeps 10ms (during which the restorer's write lands), attempt 2 finds the file and the hard_link succeeds. Empirical FAIL-at-HEAD~1 / PASS-at-HEAD proof: * Locally toggled HARDLINK_MAX_RETRIES from 3 to 1 (effectively no retry); the new test failed deterministically with: "Error { code: NotFound, messages: [\"No such file or directory (os error 2)\", \"Could not make hardlink ... after 1 attempts ...\"] }" * Restored HARDLINK_MAX_RETRIES = 3; the new test passes 5/5 runs in ~0.03s each, and the full running_actions_manager_test binary passes all 32 tests (the prior 31 + the new one). Re: TODO #2051 (deadlock with large number of files) — unchanged from the analysis in the parent commit. The per-FileEntry read lock is per-digest, not global; concurrency for download_to_directory remains governed by `fs::hard_link`'s open-file semaphore. Scaffolding additions in nativelink-store are doc(hidden) and *_for_test-named: * `FsEvictingMap` type alias made pub so external tests can name the evicting_map return type. * `SharedContext::new_for_test(temp_path, content_path)`. * `EncodedFilePath::new_content_for_test(shared_ctx, key)`. * `FilesystemStore::evicting_map_for_test()`. None are used by production code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a2f4f4a commit 9ab08e3

3 files changed

Lines changed: 483 additions & 55 deletions

File tree

nativelink-store/src/filesystem_store.rs

Lines changed: 50 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,22 @@ pub struct SharedContext {
7575
content_path: String,
7676
}
7777

78+
impl SharedContext {
79+
/// Test-only constructor that lets external crates fabricate a context
80+
/// matching a FilesystemStore's configured paths so they can build
81+
/// synthetic `EncodedFilePath` / `FileEntryImpl` instances for race
82+
/// simulations. Production code receives this only via `FilesystemStore`
83+
/// construction.
84+
#[doc(hidden)]
85+
pub const fn new_for_test(temp_path: String, content_path: String) -> Self {
86+
Self {
87+
active_drop_spawns: AtomicU64::new(0),
88+
temp_path,
89+
content_path,
90+
}
91+
}
92+
}
93+
7894
#[derive(Eq, PartialEq, Debug)]
7995
enum PathType {
8096
Content,
@@ -99,6 +115,23 @@ impl EncodedFilePath {
99115
fn get_file_path(&self) -> Cow<'_, OsStr> {
100116
get_file_path_raw(&self.path_type, self.shared_context.as_ref(), &self.key)
101117
}
118+
119+
/// Test-only constructor. Used by races simulation in nativelink-worker
120+
/// to fabricate a `FileEntryImpl` pointing at a known-on-disk file under
121+
/// the store's content path. Production code must not call this; the
122+
/// store constructs its own `EncodedFilePath` values through the
123+
/// `make_and_open_file` -> `emplace_file` flow.
124+
#[doc(hidden)]
125+
pub const fn new_content_for_test(
126+
shared_context: Arc<SharedContext>,
127+
key: StoreKey<'static>,
128+
) -> Self {
129+
Self {
130+
shared_context,
131+
path_type: PathType::Content,
132+
key,
133+
}
134+
}
102135
}
103136

104137
#[inline]
@@ -444,7 +477,13 @@ pub fn key_from_file(file_name: &str, file_type: FileType) -> Result<StoreKey<'_
444477
/// `add_files_to_cache`.
445478
const SIMULTANEOUS_METADATA_READS: usize = 200;
446479

447-
type FsEvictingMap<'a, Fe> =
480+
/// The concrete `EvictingMap` instantiation used by `FilesystemStore`. This
481+
/// alias is `pub` solely so tests in other crates (e.g. nativelink-worker's
482+
/// `download_to_directory_retries_when_entry_evicted_between_lookup_and_hardlink`)
483+
/// can name the type returned by `FilesystemStore::evicting_map_for_test`.
484+
/// Production code should treat this as an implementation detail.
485+
#[doc(hidden)]
486+
pub type FsEvictingMap<'a, Fe> =
448487
EvictingMap<StoreKeyBorrow, StoreKey<'a>, Arc<Fe>, SystemTime, RemoveItemCallbackHolder>;
449488

450489
async fn add_files_to_cache<Fe: FileEntry>(
@@ -748,6 +787,16 @@ impl<Fe: FileEntry> FilesystemStore<Fe> {
748787
self.weak_self.upgrade()
749788
}
750789

790+
/// Test-only accessor for the internal evicting map. Used by tests that
791+
/// need to deterministically simulate concurrent insert/displace races
792+
/// (e.g. the loser-entry race exercised by
793+
/// `download_to_directory_retries_when_entry_evicted_between_lookup_and_hardlink`
794+
/// in nativelink-worker). Production code must not depend on this.
795+
#[doc(hidden)]
796+
pub const fn evicting_map_for_test(&self) -> &Arc<FsEvictingMap<'static, Fe>> {
797+
&self.evicting_map
798+
}
799+
751800
pub async fn get_file_entry_for_digest(&self, digest: &DigestInfo) -> Result<Arc<Fe>, Error> {
752801
if is_zero_digest(digest) {
753802
return Ok(Arc::new(Fe::create(

nativelink-worker/src/running_actions_manager.rs

Lines changed: 114 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -157,64 +157,124 @@ pub fn download_to_directory<'a>(
157157
file_slot.write_all(&[]).await?;
158158
}
159159
else {
160-
let file_entry = filesystem_store
161-
.get_file_entry_for_digest(&digest)
162-
.await
163-
.err_tip(|| "During hard link")?;
164-
// Hold the read lock on the FileEntry across the
165-
// hard_link call. This closes the reader-side of the
166-
// emplace_file race: the writer side
167-
// (filesystem_store::emplace_file) inserts the entry
168-
// into the evicting map BEFORE it has renamed the
169-
// temp file into place, and holds a write lock on
170-
// the same RwLock for the duration of the rename.
171-
// If we extracted the path and released the lock
172-
// before calling hard_link (as the prior code did),
173-
// a concurrent reader could race in between the
174-
// writer's `insert()` and `rename()` and observe a
175-
// path that does not yet exist on disk -> ENOENT.
160+
// Bounded retry around the hard_link to close the
161+
// SECOND-PASS reader-side emplace race that the
162+
// earlier read-lock fix (commit a2f4f4ab) did not
163+
// cover.
164+
//
165+
// Race recap (companion fix to PR #2341):
166+
//
167+
// The first fix held the per-FileEntry read lock
168+
// across hard_link so that, for the entry the
169+
// reader has in hand, a writer's `unref()`
170+
// (write lock on the same RwLock) cannot rename
171+
// the file out from under the syscall. That
172+
// correctly protects the WINNER entry mid-rename.
173+
//
174+
// It does NOT protect against the LOSER-entry
175+
// case: the reader looks up the digest and gets
176+
// `Arc<entry_A>`, then a concurrent writer
177+
// `insert(entry_B)` for the SAME key displaces
178+
// entry_A and triggers `unref(entry_A)`. The
179+
// writer's unref takes entry_A's write lock; the
180+
// reader's read lock request on entry_A may
181+
// queue behind it. When the reader finally
182+
// acquires the lock, entry_A's file has been
183+
// moved to the temp path and the hard_link
184+
// returns ENOENT. The CAS still has the digest
185+
// — under entry_B in the map — so re-fetching
186+
// `get_file_entry_for_digest` returns entry_B,
187+
// whose file is on disk under the writer's
188+
// read-lock-protected rename. A single retry
189+
// resolves the race; we cap at
190+
// `HARDLINK_MAX_RETRIES` so that genuine
191+
// eviction-pressure ENOENT (no writer racing,
192+
// the digest truly is gone) cannot spin.
176193
//
177194
// Re: TODO #2051 (deadlock with large number of
178195
// files): the lock taken here is a per-FileEntry
179-
// read lock, not a global lock. Multiple concurrent
180-
// hard_link calls against DIFFERENT digests do not
181-
// contend, and multiple readers of the SAME digest
182-
// share the read lock. The only contention is
183-
// reader-vs-writer on the same digest, which is
184-
// exactly the contention we need for correctness.
185-
// The outer concurrency cap on `download_to_directory`
186-
// is governed by `fs::hard_link`'s open-file
187-
// semaphore (see nativelink_util::fs), not by this
188-
// RwLock.
189-
// TODO(#2051): revisit if the file-handle semaphore
190-
// proves insufficient under very large fan-outs.
191-
file_entry
192-
.get_file_path_locked(|src| {
193-
let dest = dest.clone();
194-
async move {
195-
fs::hard_link(src, &dest).await.map_err(|e| {
196-
if e.code == Code::NotFound {
197-
e.append(
198-
format!(
199-
"Could not make hardlink to {dest}, file was likely evicted from cache.\n\
200-
This error often occurs when the filesystem store's max_bytes is too small for your workload.\n\
201-
To fix this issue:\n\
202-
1. Increase the 'max_bytes' value in your filesystem store configuration\n\
203-
2. Example: Change 'max_bytes: 10000000000' to 'max_bytes: 50000000000' (or higher)\n\
204-
3. The setting is typically found in your nativelink.json config under:\n\
205-
stores -> [your_filesystem_store] -> filesystem -> eviction_policy -> max_bytes\n\
206-
4. Restart NativeLink after making the change\n\n\
207-
If this error persists after increasing max_bytes several times, please report at:\n\
208-
https://github.com/TraceMachina/nativelink/issues\n\
209-
Include your config file and both server and client logs to help us assist you."
210-
))
211-
} else {
212-
e.append(format!("Could not make hardlink to {dest}"))
213-
}
214-
})
196+
// read lock, not a global lock. Multiple
197+
// concurrent hard_link calls against DIFFERENT
198+
// digests do not contend, and multiple readers of
199+
// the SAME digest share the read lock. The only
200+
// contention is reader-vs-writer on the same
201+
// digest, which is exactly the contention we need
202+
// for correctness. The outer concurrency cap on
203+
// `download_to_directory` is governed by
204+
// `fs::hard_link`'s open-file semaphore (see
205+
// nativelink_util::fs), not by this RwLock.
206+
// TODO(#2051): revisit if the file-handle
207+
// semaphore proves insufficient under very large
208+
// fan-outs.
209+
const HARDLINK_MAX_RETRIES: u32 = 3;
210+
// Small backoff between retries gives a racing
211+
// writer's `emplace_file` background spawn time
212+
// to finish renaming the temp file into
213+
// `content_path/<digest>` before the next
214+
// attempt's `get_file_entry_for_digest` returns
215+
// the new entry. Without it, all retries can
216+
// race the same write window microseconds apart
217+
// and all observe ENOENT.
218+
const HARDLINK_RETRY_BACKOFF: Duration =
219+
Duration::from_millis(10);
220+
let mut last_err: Option<Error> = None;
221+
for attempt in 0..HARDLINK_MAX_RETRIES {
222+
if attempt > 0 {
223+
tokio::time::sleep(HARDLINK_RETRY_BACKOFF).await;
224+
}
225+
let file_entry = filesystem_store
226+
.get_file_entry_for_digest(&digest)
227+
.await
228+
.err_tip(|| "During hard link")?;
229+
let dest_for_attempt = dest.clone();
230+
let result = file_entry
231+
.get_file_path_locked(|src| {
232+
let dest = dest_for_attempt.clone();
233+
async move { fs::hard_link(src, &dest).await }
234+
})
235+
.await;
236+
match result {
237+
Ok(()) => {
238+
last_err = None;
239+
break;
215240
}
216-
})
217-
.await?;
241+
Err(e)
242+
if e.code == Code::NotFound
243+
&& attempt + 1 < HARDLINK_MAX_RETRIES =>
244+
{
245+
// Reader saw a stale entry that was
246+
// unref'd before we hard_link'd. The
247+
// evicting map should now have the
248+
// winning writer's entry — retry
249+
// after a short backoff (loop
250+
// continues to the next attempt).
251+
last_err = Some(e);
252+
}
253+
Err(e) => {
254+
last_err = Some(e);
255+
break;
256+
}
257+
}
258+
}
259+
if let Some(e) = last_err {
260+
return Err(if e.code == Code::NotFound {
261+
e.append(format!(
262+
"Could not make hardlink to {dest} after {HARDLINK_MAX_RETRIES} attempts, file was likely evicted from cache.\n\
263+
This error often occurs when the filesystem store's max_bytes is too small for your workload.\n\
264+
To fix this issue:\n\
265+
1. Increase the 'max_bytes' value in your filesystem store configuration\n\
266+
2. Example: Change 'max_bytes: 10000000000' to 'max_bytes: 50000000000' (or higher)\n\
267+
3. The setting is typically found in your nativelink.json config under:\n\
268+
stores -> [your_filesystem_store] -> filesystem -> eviction_policy -> max_bytes\n\
269+
4. Restart NativeLink after making the change\n\n\
270+
If this error persists after increasing max_bytes several times, please report at:\n\
271+
https://github.com/TraceMachina/nativelink/issues\n\
272+
Include your config file and both server and client logs to help us assist you."
273+
))
274+
} else {
275+
e.append(format!("Could not make hardlink to {dest}"))
276+
});
277+
}
218278
}
219279
#[cfg(target_family = "unix")]
220280
if let Some(unix_mode) = unix_mode {

0 commit comments

Comments
 (0)