Multicore safety protect shared mutable state with mutexes and atomics#2397
Multicore safety protect shared mutable state with mutexes and atomics#2397
Conversation
|
An anlysis of the remaining tasks from @lyrm's list, by Claude, gives the following text: Notes on lyrm's review checklist (PR #2149)Items investigated but not changed
Items deferred to other PRs
|
Replace mutable suffix/prefix fields with Atomic.t to prevent use-after-close races during GC swap when concurrent Eio fibers read these fields via the dispatcher. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an Eio.Mutex to Dict.t and guard find, index and refill with it. This prevents concurrent fiber access to the stdlib Hashtbl during refill (which yields on I/O before mutating the tables). A custom with_lock helper is used instead of Eio.Mutex.use_rw to avoid poisoning the mutex when append_exn raises RO_not_allowed (the exception occurs before any hashtable mutation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an Eio.Mutex to rw_perm and guard append_exn and flush with it. This prevents data loss when a concurrent fiber appends to the buffer while flush is performing I/O (yield point between Buffer.contents and Buffer.truncate). A flush_locked internal variant avoids deadlock when append_exn triggers an auto-flush while already holding the lock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace mutable chunks array with Atomic.t so that readers (find, fold, length) always see a consistent snapshot of the array, even when add_new_appendable extends it concurrently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the Atomic.get checks for during_batch and running_gc inside the Repo.lock critical section in start and split, so that the condition cannot change between the check and the lock acquisition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add early during_batch checks before acquiring Repo.lock in start, split and cancel. This turns a potential deadlock (when called from a batch ~lock:true callback) into an explicit error. The checks are also kept under the lock to close the TOCTOU window. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
listen_dir_hook, watch_switch and workers_r were global refs accessed without synchronisation. Replace them with Atomic.t to make them safe under multi-domain access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a mutex around the lock_file find-or-create pattern in IO_mem to prevent concurrent fibers from creating duplicate locks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a mutex around the find-or-create pattern on the global cache hashtable to prevent concurrent fibers from creating duplicate store instances. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Mutate variant applied an in-place mutation which is incompatible
with the Atomic.compare_and_set retry loop used by update. All callers
already use Replace, so simplify the API to take a plain ('a -> 'a)
function.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These catch-all branches in try/with blocks simply re-raise the exception and have no effect. Removing them for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Lwt.return in the index function documentation example with a direct-style call to match the Eio migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace "Lwt monad" with "direct-style" in the recursive nodes documentation to reflect the Eio migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate the custom_merge and sync code examples to direct-style (remove open Lwt.Infix, >>=, >|=, Lwt_main.run, Lwt_list). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace "Not sure this is right" with a description of what stream_iter does and its limitation (None is ignored). Replace "Big Yikes" failwith with an actionable error message. Replace "terrible hack" doc with an explanation of why the global switch exists and what the proper fix would be. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
This PR adds multicore safety to shared mutable state using mutexes and atomics. This fixes at least one multicore test failure (
test_concurrent_lowcrashes whenappend_exnis not protected by a mutex).Thread-safety fixes
Buffer.tinappend_only_file.mlwith anEio.Mutex— the race betweenappend_exnandflushis triggered bytest_concurrent_lowwithout the fixmutable suffix/prefixfields infile_manager.mlwithAtomic.tto prevent use-after-close during GC swapHashtbl.tindict.mlwith anEio.Mutex(guardsrefillwhich yields on I/O before mutating)mutable chunksarray inchunked_suffix.mlwithAtomic.tfor consistent snapshots during concurrent readsduring_batch/running_gcchecks instore.mlby moving them insideRepo.lockbatch(~lock:true)callbackrefwithAtomic.tinwatch.mlHashtbl.twithEio.Mutexinirmin_fs.mlandirmin_mem.mlCleanup (from lyrm review checklist on PR #2149)
Mutateconstructor fromMetrics.update_mode— incompatible with theAtomic.compare_and_setretry loop, and never instantiated by any caller| e -> raise eexception re-raises inirmin_fs_unix.ml,inode.ml,tree.mlindexable_intf.mlandnode_intf.mldocumentationirmin.mlicode examples (custom_merge, sync) from Lwt to direct-stylewatch.ml/watch_intf.ml: replace "Big Yikes" with actionable error, documentstream_iterlimitation, explainset_watch_switchworkaroundAtomic.setvscompare_and_setusage: no correctness issues foundNot addressed in this PR
f () |> functioncleanup (~33 occurrences) — mechanical, better as a dedicated PReio_pool.mlvsEio.Pool— not redundant (provides post-failure check and pool clearing)index,irmin-watcher) — tracked in separate PRsPerformance
Benchmarked with
bench_inline(50k contents, 5k reads), stabilised withtaskset -c 0and pre-built binaries — no measurable regression (write +1.6%, read p50 +1.0%, within noise).Test plan
dune runtest test/irmin-pack/— 252 tests passappend_only_filemutex causestest_concurrent_lowto crash, confirming the race is realdune build src/irmin/ src/irmin-pack/ src/irmin-fs/compiles cleanlyPlease read each commit individually.