Skip to content

Commit abaa830

Browse files
authored
Implement fine grain locking for build-dir (#16155)
This PR adds fine grain locking for the build cache using build unit level locking. I'd recommend reading the design details in this description and then reviewing commit by commit. Part of #4282 Previous attempt: #16089 ## Design decisions / rational - Still hold `build-dir/<profile>/.cargo-lock` - to protect against `cargo clean` (exclusive) - changed from exclusive to shared for builds - Using build unit level locking with a single lock per build unit. - Before checking fingerprint freshness we take a shared lock. This prevents reading a fingerprint while another build is active. - For units that are dirty, when the job server queues the job we take an exclusive lock to prevent others from reading while we compile. - This is done by dropping the shared lock and then acquiring an exclusive lock, rather than downgrading the lock, to protect against deadlock, see #16155 (comment) - After the unit's compilation is complete, we downgrade back to a shared lock allowing other readers. - All locks are released at the end of the entire build process - artifact-dir was handled in #16307. For the rational for this design see the discussion [#t-cargo > Build cache and locking design @ 💬](https://rust-lang.zulipchat.com/#narrow/channel/246057-t-cargo/topic/Build.20cache.20and.20locking.20design/near/561677181) ## Open Questions - [ ] Do we need rlimit checks and dynamic rlimits? #16155 (comment) - [ ] Proper handling of blocking message (#16155 (comment)) - Update Dec 18 2025: With updated impl, we now get the blocking message when taking the initial shared lock, but we get no message when taking the exclusive lock right before compiling. - [ ] Reduce parallelism when blocking - [x] How do we want to handle locking on the artifact directory? - We could simply continue using coarse grain locking, locking and unlocking when files are uplifted. - One downside of locking/unlocking multiple times per invocation is that artifact-dir is touch many times across the compilation process (for example, there is a pre-rustc [clean up step](https://github.com/rust-lang/cargo/blob/master/src/cargo/core/compiler/mod.rs#L402) Also we need to take into account other commands like `cargo doc` - Another option would to only take a lock on the artifact-dir for commands that we know will uplift files. (e.g. `cargo check` would not take a lock artifact-dir but `cargo build` would). This would mean that 2 `cargo build` invocations would not run in parallel because one of them would hold the lock artifact-dir (blocking the other). This might actually be ideal to avoid 2 instances fighting over the CPU while recompiling the same crates. - Solved by #16307 - [ ] What should our testing strategy for locking be? - My testing strategy thus far has been to run cargo on dummy projects to verify the locking. - For the max file descriptor testing, I have been using the Zed codebase as a testbed as it has over 1,500 build units which is more than the default ulimit on my linux system. (I am happy to test this on other large codebase that we think would be good to verify against) - It’s not immediately obvious to me as to how to create repeatable unit tests for this or what those tests should be testing for. - For performance testing, I have been using hyperfine to benchmark builds with and without `-Zbuild-dir-new-layout`. With the current implementation I am not seeing any perf regression on linux but I have yet to test on windows/macos. --- <details><summary>Original Design</summary> - Using build unit level locking instead of a temporary working directory. - After experimenting with multiple approaches, I am currently leaning to towards build unit level locking. - The working directory approach introduces a fair bit of uplifting complexity and I further along I pushed my prototype the more I ran into unexpected issues. - mtime changes in fingerprints due to uplifting/downlifting order - tests/benches need to be ran before being uplifted OR uplifted and locked during execution which leads to more locking design needed. (also running pre-uplift introduces other potential side effects like the path displayed to the user being deleted as its temporary) - The trade off here is that with build unit level locks, we need a more advanced locking mechanism and we will have more open locks at once. - The reason I think this is a worth while trade of is that the locking complexity can largely be contained to single module where the uplifting complexity would be spread through out the cargo codebase anywhere we do uplifting. The increased locks count while unavoidable can be mitigated (see below for more details) - Risk of too many locks (file descriptors) - On Linux 1024 is a fairly common default soft limit. Windows is even lower at 256. - Having 2 locks per build unit makes is possible to hit with a moderate amount of dependencies - There are a few mitigations I could think of for this problem (that are included in this PR) - Increasing the file descriptor limits of based on the number of build units (if hard limit is high enough) - Share file descriptors for shared locks across jobs (within a single process) using a virtual lock - This could be implemented using reference counting. - Falling back to coarse grain locking if some heuristic is not met ### Implementation details - We have a stateful lock per build unit made up of multiple file locks `primary.lock` and `secondary.lock` (see [`locking.rs`](http://locking.rs) module docs for more details on the states) - This is needed to enable pipelined builds - We fall back to coarse grain locking if fine grain locking is determined to be unsafe (see `determine_locking_mode()`) - Fine grain locking continues to take the existing `.cargo-lock` lock as RO shared to continue working with older cargo versions while allowing multiple newer cargo instances to run in parallel. - Locking is disabled on network filesystems. (keeping existing behavior from #2623) - `cargo clean` continues to use coarse grain locking for simplicity. - File descriptors - I added functionality to increase the file descriptors if cargo detects that there will not be enough based on the number of build units in the `UnitGraph`. - If we aren’t able to increase a threshold (currently `number of build units * 10`) we automatically fallback to coarse grain locking and display a warning to the user. - I picked 10 times the number of build units a conservative estimate for now. I think lowering this number may be reasonable. - While testing, I was seeing a peak of ~3,200 open file descriptors while compiling Zed. This is approximately x2 the number of build units. - Without the `RcFileLock` I was seeing peaks of ~12,000 open fds which I felt was quiet high even for a large project like Zed. - We use a global `FileLockInterner` that holds on to the file descriptors (`RcFileLock`) until the end of the process. (We could potentially add it to `JobState` if preferred, it would just be a bit more plumbing) See #16155 (comment) for proposal to transition away from this to the current scheme </details>
2 parents 577ab6f + 4235aa2 commit abaa830

13 files changed

Lines changed: 320 additions & 75 deletions

File tree

src/cargo/core/compiler/build_runner/compilation_files.rs

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -250,12 +250,16 @@ impl<'a, 'gctx: 'a> CompilationFiles<'a, 'gctx> {
250250
false => "-",
251251
};
252252
let name = unit.pkg.package_id().name();
253-
let meta = self.metas[unit];
254-
let hash = meta
253+
let hash = self.unit_hash(unit);
254+
format!("{name}{separator}{hash}")
255+
}
256+
257+
/// The directory hash to use for a given unit
258+
pub fn unit_hash(&self, unit: &Unit) -> String {
259+
self.metas[unit]
255260
.pkg_dir()
256261
.map(|h| h.to_string())
257-
.unwrap_or_else(|| self.target_short_hash(unit));
258-
format!("{name}{separator}{hash}")
262+
.unwrap_or_else(|| self.target_short_hash(unit))
259263
}
260264

261265
/// Returns the final artifact path for the host (`/…/target/debug`)
@@ -296,6 +300,15 @@ impl<'a, 'gctx: 'a> CompilationFiles<'a, 'gctx> {
296300
self.layout(unit.kind).build_dir().fingerprint(&dir)
297301
}
298302

303+
/// The lock location for a given build unit.
304+
pub fn build_unit_lock(&self, unit: &Unit) -> PathBuf {
305+
let dir = self.pkg_dir(unit);
306+
self.layout(unit.kind)
307+
.build_dir()
308+
.build_unit(&dir)
309+
.join(".lock")
310+
}
311+
299312
/// Directory where incremental output for the given unit should go.
300313
pub fn incremental_dir(&self, unit: &Unit) -> &Path {
301314
self.layout(unit.kind).build_dir().incremental()

src/cargo/core/compiler/build_runner/mod.rs

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ use std::sync::{Arc, Mutex};
66

77
use crate::core::PackageId;
88
use crate::core::compiler::compilation::{self, UnitOutput};
9+
use crate::core::compiler::locking::LockManager;
910
use crate::core::compiler::{self, Unit, UserIntent, artifact};
1011
use crate::util::cache_lock::CacheLockMode;
1112
use crate::util::errors::CargoResult;
@@ -87,6 +88,9 @@ pub struct BuildRunner<'a, 'gctx> {
8788
/// because the target has a type error. This is in an Arc<Mutex<..>>
8889
/// because it is continuously updated as the job progresses.
8990
pub failed_scrape_units: Arc<Mutex<HashSet<UnitHash>>>,
91+
92+
/// Manages locks for build units when fine grain locking is enabled.
93+
pub lock_manager: Arc<LockManager>,
9094
}
9195

9296
impl<'a, 'gctx> BuildRunner<'a, 'gctx> {
@@ -126,6 +130,7 @@ impl<'a, 'gctx> BuildRunner<'a, 'gctx> {
126130
lto: HashMap::new(),
127131
metadata_for_doc_units: HashMap::new(),
128132
failed_scrape_units: Arc::new(Mutex::new(HashSet::new())),
133+
lock_manager: Arc::new(LockManager::new()),
129134
})
130135
}
131136

@@ -417,7 +422,8 @@ impl<'a, 'gctx> BuildRunner<'a, 'gctx> {
417422
| UserIntent::Doctest
418423
| UserIntent::Bench => true,
419424
};
420-
let host_layout = Layout::new(self.bcx.ws, None, &dest, must_take_artifact_dir_lock)?;
425+
let host_layout =
426+
Layout::new(self.bcx.ws, None, &dest, must_take_artifact_dir_lock, false)?;
421427
let mut targets = HashMap::new();
422428
for kind in self.bcx.all_kinds.iter() {
423429
if let CompileKind::Target(target) = *kind {
@@ -426,6 +432,7 @@ impl<'a, 'gctx> BuildRunner<'a, 'gctx> {
426432
Some(target),
427433
&dest,
428434
must_take_artifact_dir_lock,
435+
false,
429436
)?;
430437
targets.insert(target, layout);
431438
}

src/cargo/core/compiler/job_queue/job.rs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,12 @@ impl Job {
8585
let prev = mem::replace(&mut self.work, Work::noop());
8686
self.work = next.then(prev);
8787
}
88+
89+
/// Chains the given work by putting it after of our own unit of work.
90+
pub fn after(&mut self, next: Work) {
91+
let prev = mem::replace(&mut self.work, Work::noop());
92+
self.work = prev.then(next);
93+
}
8894
}
8995

9096
impl fmt::Debug for Job {

src/cargo/core/compiler/job_queue/job_state.rs

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,11 @@ use std::{cell::Cell, marker, sync::Arc};
44

55
use cargo_util::ProcessBuilder;
66

7-
use crate::CargoResult;
87
use crate::core::compiler::future_incompat::FutureBreakageItem;
8+
use crate::core::compiler::locking::LockKey;
99
use crate::core::compiler::timings::SectionTiming;
1010
use crate::util::Queue;
11+
use crate::{CargoResult, core::compiler::locking::LockManager};
1112

1213
use super::{Artifact, DiagDedupe, Job, JobId, Message};
1314

@@ -47,6 +48,9 @@ pub struct JobState<'a, 'gctx> {
4748
/// sending a double message later on.
4849
rmeta_required: Cell<bool>,
4950

51+
/// Manages locks for build units when fine grain locking is enabled.
52+
lock_manager: Arc<LockManager>,
53+
5054
// Historical versions of Cargo made use of the `'a` argument here, so to
5155
// leave the door open to future refactorings keep it here.
5256
_marker: marker::PhantomData<&'a ()>,
@@ -58,12 +62,14 @@ impl<'a, 'gctx> JobState<'a, 'gctx> {
5862
messages: Arc<Queue<Message>>,
5963
output: Option<&'a DiagDedupe<'gctx>>,
6064
rmeta_required: bool,
65+
lock_manager: Arc<LockManager>,
6166
) -> Self {
6267
Self {
6368
id,
6469
messages,
6570
output,
6671
rmeta_required: Cell::new(rmeta_required),
72+
lock_manager,
6773
_marker: marker::PhantomData,
6874
}
6975
}
@@ -141,6 +147,14 @@ impl<'a, 'gctx> JobState<'a, 'gctx> {
141147
.push(Message::Finish(self.id, Artifact::Metadata, Ok(())));
142148
}
143149

150+
pub fn lock_exclusive(&self, lock: &LockKey) -> CargoResult<()> {
151+
self.lock_manager.lock(lock)
152+
}
153+
154+
pub fn downgrade_to_shared(&self, lock: &LockKey) -> CargoResult<()> {
155+
self.lock_manager.downgrade_to_shared(lock)
156+
}
157+
144158
pub fn on_section_timing_emitted(&self, section: SectionTiming) {
145159
self.messages.push(Message::SectionTiming(self.id, section));
146160
}

src/cargo/core/compiler/job_queue/mod.rs

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -951,9 +951,10 @@ impl<'gctx> DrainState<'gctx> {
951951
let messages = self.messages.clone();
952952
let is_fresh = job.freshness().is_fresh();
953953
let rmeta_required = build_runner.rmeta_required(unit);
954+
let lock_manager = build_runner.lock_manager.clone();
954955

955956
let doit = move |diag_dedupe| {
956-
let state = JobState::new(id, messages, diag_dedupe, rmeta_required);
957+
let state = JobState::new(id, messages, diag_dedupe, rmeta_required, lock_manager);
957958
state.run_to_finish(job);
958959
};
959960

src/cargo/core/compiler/layout.rs

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,7 @@ impl Layout {
128128
target: Option<CompileTarget>,
129129
dest: &str,
130130
must_take_artifact_dir_lock: bool,
131+
must_take_build_dir_lock_exclusively: bool,
131132
) -> CargoResult<Layout> {
132133
let is_new_layout = ws.gctx().cli_unstable().build_dir_new_layout;
133134
let mut root = ws.target_dir();
@@ -160,11 +161,20 @@ impl Layout {
160161
{
161162
None
162163
} else {
163-
Some(build_dest.open_rw_exclusive_create(
164-
".cargo-lock",
165-
ws.gctx(),
166-
"build directory",
167-
)?)
164+
if ws.gctx().cli_unstable().fine_grain_locking && !must_take_build_dir_lock_exclusively
165+
{
166+
Some(build_dest.open_ro_shared_create(
167+
".cargo-lock",
168+
ws.gctx(),
169+
"build directory",
170+
)?)
171+
} else {
172+
Some(build_dest.open_rw_exclusive_create(
173+
".cargo-lock",
174+
ws.gctx(),
175+
"build directory",
176+
)?)
177+
}
168178
};
169179
let build_root = build_root.into_path_unlocked();
170180
let build_dest = build_dest.as_path_unlocked();

src/cargo/core/compiler/locking.rs

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
//! This module handles the locking logic during compilation.
2+
3+
use crate::{
4+
CargoResult,
5+
core::compiler::{BuildRunner, Unit},
6+
util::{FileLock, Filesystem},
7+
};
8+
use anyhow::bail;
9+
use std::{
10+
collections::HashMap,
11+
fmt::{Display, Formatter},
12+
path::PathBuf,
13+
sync::Mutex,
14+
};
15+
use tracing::instrument;
16+
17+
/// A struct to store the lock handles for build units during compilation.
18+
pub struct LockManager {
19+
locks: Mutex<HashMap<LockKey, FileLock>>,
20+
}
21+
22+
impl LockManager {
23+
pub fn new() -> Self {
24+
Self {
25+
locks: Mutex::new(HashMap::new()),
26+
}
27+
}
28+
29+
/// Takes a shared lock on a given [`Unit`]
30+
/// This prevents other Cargo instances from compiling (writing) to
31+
/// this build unit.
32+
///
33+
/// This function returns a [`LockKey`] which can be used to
34+
/// upgrade/unlock the lock.
35+
#[instrument(skip_all, fields(key))]
36+
pub fn lock_shared(
37+
&self,
38+
build_runner: &BuildRunner<'_, '_>,
39+
unit: &Unit,
40+
) -> CargoResult<LockKey> {
41+
let key = LockKey::from_unit(build_runner, unit);
42+
tracing::Span::current().record("key", key.0.to_str());
43+
44+
let mut locks = self.locks.lock().unwrap();
45+
if let Some(lock) = locks.get_mut(&key) {
46+
lock.file().lock_shared()?;
47+
} else {
48+
let fs = Filesystem::new(key.0.clone());
49+
let lock_msg = format!(
50+
"{} ({})",
51+
unit.pkg.name(),
52+
build_runner.files().unit_hash(unit)
53+
);
54+
let lock = fs.open_ro_shared_create(&key.0, build_runner.bcx.gctx, &lock_msg)?;
55+
locks.insert(key.clone(), lock);
56+
}
57+
58+
Ok(key)
59+
}
60+
61+
#[instrument(skip(self))]
62+
pub fn lock(&self, key: &LockKey) -> CargoResult<()> {
63+
let mut locks = self.locks.lock().unwrap();
64+
if let Some(lock) = locks.get_mut(&key) {
65+
lock.file().lock()?;
66+
} else {
67+
bail!("lock was not found in lock manager: {key}");
68+
}
69+
70+
Ok(())
71+
}
72+
73+
/// Upgrades an existing exclusive lock into a shared lock.
74+
#[instrument(skip(self))]
75+
pub fn downgrade_to_shared(&self, key: &LockKey) -> CargoResult<()> {
76+
let mut locks = self.locks.lock().unwrap();
77+
let Some(lock) = locks.get_mut(key) else {
78+
bail!("lock was not found in lock manager: {key}");
79+
};
80+
lock.file().lock_shared()?;
81+
Ok(())
82+
}
83+
84+
#[instrument(skip(self))]
85+
pub fn unlock(&self, key: &LockKey) -> CargoResult<()> {
86+
let mut locks = self.locks.lock().unwrap();
87+
if let Some(lock) = locks.get_mut(key) {
88+
lock.file().unlock()?;
89+
};
90+
91+
Ok(())
92+
}
93+
}
94+
95+
#[derive(Debug, Clone, Hash, Eq, PartialEq)]
96+
pub struct LockKey(PathBuf);
97+
98+
impl LockKey {
99+
fn from_unit(build_runner: &BuildRunner<'_, '_>, unit: &Unit) -> Self {
100+
Self(build_runner.files().build_unit_lock(unit))
101+
}
102+
}
103+
104+
impl Display for LockKey {
105+
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
106+
write!(f, "{}", self.0.display())
107+
}
108+
}

src/cargo/core/compiler/mod.rs

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ pub mod future_incompat;
4141
pub(crate) mod job_queue;
4242
pub(crate) mod layout;
4343
mod links;
44+
mod locking;
4445
mod lto;
4546
mod output_depinfo;
4647
mod output_sbom;
@@ -94,6 +95,7 @@ use self::output_sbom::build_sbom;
9495
use self::unit_graph::UnitDep;
9596

9697
use crate::core::compiler::future_incompat::FutureIncompatReport;
98+
use crate::core::compiler::locking::LockKey;
9799
use crate::core::compiler::timings::SectionTiming;
98100
pub use crate::core::compiler::unit::{Unit, UnitInterner};
99101
use crate::core::manifest::TargetSourcePath;
@@ -187,6 +189,12 @@ fn compile<'gctx>(
187189
return Ok(());
188190
}
189191

192+
let lock = if build_runner.bcx.gctx.cli_unstable().fine_grain_locking {
193+
Some(build_runner.lock_manager.lock_shared(build_runner, unit)?)
194+
} else {
195+
None
196+
};
197+
190198
// If we are in `--compile-time-deps` and the given unit is not a compile time
191199
// dependency, skip compiling the unit and jumps to dependencies, which still
192200
// have chances to be compile time dependencies
@@ -228,6 +236,23 @@ fn compile<'gctx>(
228236
work.then(link_targets(build_runner, unit, true)?)
229237
});
230238

239+
// If -Zfine-grain-locking is enabled, we wrap the job with an upgrade to exclusive
240+
// lock before starting, then downgrade to a shared lock after the job is finished.
241+
if build_runner.bcx.gctx.cli_unstable().fine_grain_locking && job.freshness().is_dirty()
242+
{
243+
if let Some(lock) = lock {
244+
// Here we unlock the current shared lock to avoid deadlocking with other cargo
245+
// processes. Then we configure our compile job to take an exclusive lock
246+
// before starting. Once we are done compiling (including both rmeta and rlib)
247+
// we downgrade to a shared lock to allow other cargo's to read the build unit.
248+
// We will hold this shared lock for the remainder of compilation to prevent
249+
// other cargo from re-compiling while we are still using the unit.
250+
build_runner.lock_manager.unlock(&lock)?;
251+
job.before(prebuild_lock_exclusive(lock.clone()));
252+
job.after(downgrade_lock_to_shared(lock));
253+
}
254+
}
255+
231256
job
232257
};
233258
jobs.enqueue(build_runner, unit, job)?;
@@ -589,6 +614,20 @@ fn verbose_if_simple_exit_code(err: Error) -> Error {
589614
}
590615
}
591616

617+
fn prebuild_lock_exclusive(lock: LockKey) -> Work {
618+
Work::new(move |state| {
619+
state.lock_exclusive(&lock)?;
620+
Ok(())
621+
})
622+
}
623+
624+
fn downgrade_lock_to_shared(lock: LockKey) -> Work {
625+
Work::new(move |state| {
626+
state.downgrade_to_shared(&lock)?;
627+
Ok(())
628+
})
629+
}
630+
592631
/// Link the compiled target (often of form `foo-{metadata_hash}`) to the
593632
/// final target. This must happen during both "Fresh" and "Compile".
594633
fn link_targets(

0 commit comments

Comments
 (0)