Skip to content

Multicore safety protect shared mutable state with mutexes and atomics#2397

Open
balat wants to merge 17 commits intoeiofrom
multicore-safety
Open

Multicore safety protect shared mutable state with mutexes and atomics#2397
balat wants to merge 17 commits intoeiofrom
multicore-safety

Conversation

@balat
Copy link
Copy Markdown
Member

@balat balat commented Mar 20, 2026

Summary

This PR adds multicore safety to shared mutable state using mutexes and atomics. This fixes at least one multicore test failure (test_concurrent_low crashes when append_exn is not protected by a mutex).

Thread-safety fixes

  • Protect Buffer.t in append_only_file.ml with an Eio.Mutex — the race between append_exn and flush is triggered by test_concurrent_low without the fix
  • Replace mutable suffix/prefix fields in file_manager.ml with Atomic.t to prevent use-after-close during GC swap
  • Protect Hashtbl.t in dict.ml with an Eio.Mutex (guards refill which yields on I/O before mutating)
  • Replace mutable chunks array in chunked_suffix.ml with Atomic.t for consistent snapshots during concurrent reads
  • Fix TOCTOU on during_batch/running_gc checks in store.ml by moving them inside Repo.lock
  • Prevent deadlock when calling GC or split from a batch(~lock:true) callback
  • Replace global ref with Atomic.t in watch.ml
  • Protect global Hashtbl.t with Eio.Mutex in irmin_fs.ml and irmin_mem.ml

Cleanup (from lyrm review checklist on PR #2149)

  • Remove unused Mutate constructor from Metrics.update_mode — incompatible with the Atomic.compare_and_set retry loop, and never instantiated by any caller
  • Remove redundant | e -> raise e exception re-raises in irmin_fs_unix.ml, inode.ml, tree.ml
  • Update outdated Lwt references in indexable_intf.ml and node_intf.ml documentation
  • Migrate irmin.mli code examples (custom_merge, sync) from Lwt to direct-style
  • Improve frightening comments in watch.ml / watch_intf.ml: replace "Big Yikes" with actionable error, document stream_iter limitation, explain set_watch_switch workaround
  • Review all Atomic.set vs compare_and_set usage: no correctness issues found

Not addressed in this PR

  • f () |> function cleanup (~33 occurrences) — mechanical, better as a dedicated PR
  • eio_pool.ml vs Eio.Pool — not redundant (provides post-failure check and pool clearing)
  • Pin-depends removal (index, irmin-watcher) — tracked in separate PRs

Performance

Benchmarked with bench_inline (50k contents, 5k reads), stabilised with taskset -c 0 and pre-built binaries — no measurable regression (write +1.6%, read p50 +1.0%, within noise).

Test plan

  • dune runtest test/irmin-pack/ — 252 tests pass
  • Removing the append_only_file mutex causes test_concurrent_low to crash, confirming the race is real
  • dune build src/irmin/ src/irmin-pack/ src/irmin-fs/ compiles cleanly

Please read each commit individually.

@balat balat changed the title Multicore safety Multicore safety protect shared mutable state with mutexes and atomics Mar 20, 2026
@balat
Copy link
Copy Markdown
Member Author

balat commented Apr 3, 2026

An anlysis of the remaining tasks from @lyrm's list, by Claude, gives the following text:

Notes on lyrm's review checklist (PR #2149)

Items investigated but not changed

  • progress dependency in src/irmin-pack/unix/dune: not redundant. The progress library is actively used in Utils (progress bars), traverse_pack_file.ml (Progress.Line.Using_int63), and checks.ml (Object_counter). No action needed.

  • ?lock parameter in batch: actively used. Only irmin-pack/io/store.ml implements the locking behaviour (Eio.Mutex.use_rw when ~lock:true), and it is called with ~lock:true from Commit.v in store.ml:180. Other backends ignore it (?lock:_), which is fine — they don't need it. Not dead code.

  • eio_pool.ml vs Eio.Pool: not redundant. The custom pool provides two features that Eio.Pool lacks: (1) a check callback for asynchronous validation after a use fails, and (2) a clear function that tracks in-use resources. Used for mkdir_pool (size 1, serialising directory creation) and openfile_pool (size 200, fd reuse) in irmin_fs_unix.ml.

  • Atomic.set vs compare_and_set review: all uses of Atomic.set in src/ are either inside a mutex, unconditional overwrites on caches, or one-shot initialisations. No correctness issue found. pack_key.ml and metrics.ml already correctly use compare_and_set with retry loops where needed.

  • test/irmin-pack/data/version_1_large/README.md: contains Lwt-style code but documents the program that generated the test data with irmin-pack.2.10.2. It is a historical record and should not be updated.

Items deferred to other PRs

  • f () |> functionmatch f () with (~33 occurrences): mechanical cleanup, better as a dedicated PR
  • Pin-depends removal (index, irmin-watcher): tracked separately (see PR Integrate eio watcher #2336 for irmin-watcher)

balat and others added 17 commits April 15, 2026 12:06
Replace mutable suffix/prefix fields with Atomic.t to prevent
use-after-close races during GC swap when concurrent Eio fibers
read these fields via the dispatcher.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an Eio.Mutex to Dict.t and guard find, index and refill with it.
This prevents concurrent fiber access to the stdlib Hashtbl during
refill (which yields on I/O before mutating the tables).

A custom with_lock helper is used instead of Eio.Mutex.use_rw to
avoid poisoning the mutex when append_exn raises RO_not_allowed
(the exception occurs before any hashtable mutation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an Eio.Mutex to rw_perm and guard append_exn and flush with it.
This prevents data loss when a concurrent fiber appends to the buffer
while flush is performing I/O (yield point between Buffer.contents
and Buffer.truncate).

A flush_locked internal variant avoids deadlock when append_exn
triggers an auto-flush while already holding the lock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace mutable chunks array with Atomic.t so that readers
(find, fold, length) always see a consistent snapshot of the
array, even when add_new_appendable extends it concurrently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the Atomic.get checks for during_batch and running_gc inside
the Repo.lock critical section in start and split, so that the
condition cannot change between the check and the lock acquisition.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add early during_batch checks before acquiring Repo.lock in start,
split and cancel. This turns a potential deadlock (when called from
a batch ~lock:true callback) into an explicit error. The checks are
also kept under the lock to close the TOCTOU window.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
listen_dir_hook, watch_switch and workers_r were global refs
accessed without synchronisation. Replace them with Atomic.t
to make them safe under multi-domain access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a mutex around the lock_file find-or-create pattern in IO_mem
to prevent concurrent fibers from creating duplicate locks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a mutex around the find-or-create pattern on the global
cache hashtable to prevent concurrent fibers from creating
duplicate store instances.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Mutate variant applied an in-place mutation which is incompatible
with the Atomic.compare_and_set retry loop used by update. All callers
already use Replace, so simplify the API to take a plain ('a -> 'a)
function.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These catch-all branches in try/with blocks simply re-raise the
exception and have no effect. Removing them for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Lwt.return in the index function documentation example
with a direct-style call to match the Eio migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace "Lwt monad" with "direct-style" in the recursive nodes
documentation to reflect the Eio migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrate the custom_merge and sync code examples to direct-style
(remove open Lwt.Infix, >>=, >|=, Lwt_main.run, Lwt_list).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace "Not sure this is right" with a description of what
stream_iter does and its limitation (None is ignored).
Replace "Big Yikes" failwith with an actionable error message.
Replace "terrible hack" doc with an explanation of why the global
switch exists and what the proper fix would be.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant