Skip to content

Add TSan stress suite (PR B for TSan CI)#2403

Open
cuihtlauac wants to merge 4 commits intomirage:eiofrom
cuihtlauac:tsan-stress
Open

Add TSan stress suite (PR B for TSan CI)#2403
cuihtlauac wants to merge 4 commits intomirage:eiofrom
cuihtlauac:tsan-stress

Conversation

@cuihtlauac
Copy link
Copy Markdown

@cuihtlauac cuihtlauac commented Apr 24, 2026

Summary

Fills in the @tsan-stress alias shipped empty by #2402 (PR A). Five per-hotspot race-hunting scenarios land under test/irmin-pack/test_tsan_stress/, each targeting mutable state that the existing multicore + QCheck-STM suites do not exercise across domains.

Stacked on #2402. Target is eio; please merge #2402 first. This branch contains #2402's commit plus two more.

Scenarios and what they produce under TSan

Scenario Hotspot Expected TSan output
stress_mem_cache src/irmin/mem/irmin_mem.ml:44,51 Clean data-race warning (WARNING: ThreadSanitizer) pointing at Hashtbl.add in the global cache
stress_watch src/irmin/watch.ml:33 Clean data-race warning pointing at the listen_dir_hook := assignment
stress_ao_buf src/irmin-pack/io/append_only_file.ml (Buffer.t in rw_perm) SEGV (TSan signal handler) — the Buffer corruption is fast enough that TSan aborts before the race warning lands. The crash itself is evidence of the race.
stress_dict src/irmin-pack/io/dict.ml:25-31 SEGV on the same pattern (two unguarded Hashtbl.t + append path).
stress_fs_pool src/irmin-fs/unix/irmin_fs_unix.ml:75,87 (Eio_pool) "Nested bug, aborting" — TSan unable to unwind through the Eio scheduler/pool interaction. Demonstrates the problem without pinpointing it.

The @tsan-stress alias runs each in its own process (main.exe <name> || true chained), so one scenario's crash does not suppress the others.

Correspondence with #2397

After locally merging #2397 on top of eio + this PR, re-running the suite gave the following result (reported with IRMIN_TSAN_STRESS_ITER up to 500):

Scenario #2397 fix commit Outcome with #2397 applied
stress_watch d050ed73e0 Replace global refs with Atomic.t in watch.ml Fixed — zero TSan findings.
stress_mem_cache 289e6272ff Protect in-memory store cache with a mutex in irmin_mem Partially fixed. The Eio.Mutex around the global cache Hashtbl eliminates races at irmin_mem.ml:51, but the per-instance mutable t : KMap.t field is still unsynchronised. Warnings now appear at irmin_mem.ml:72 (Read_only.find), :86 (Append_only.add), :136:142 (Atomic_write.test_and_set). Downstream symptom: concurrent Store.set_exn still produces Irmin.Tree.find_tree.findv: encountered dangling hash and set_exn: Failure after 13 attempts to retry.
stress_ao_buf b690c5b2b1 Protect append-only file buffer with a mutex Inconclusive. The mutex is in place and the hot path is serialised, but TSan still exits via SEGV on every run. Could be (a) a residual race outside the lock's scope, or (b) a TSan-on-OCaml-5 shadow-memory issue. Needs a narrower reproducer to distinguish.
stress_dict 788ad16a7b Protect dict hashtables with a mutex Inconclusive for the same reason as stress_ao_buf.
stress_fs_pool ec0c10e619 protects Irmin_fs.IO_mem but not Irmin_fs_unix's mkdir_pool / openfile_pool Unchanged — "nested bug, aborting" still fires. #2397's scope does not include irmin_fs_unix.

Takeaway: #2397 is necessary but not sufficient. Two concrete gaps remain (irmin_mem KMap, irmin_fs_unix pools) plus two cases where the fix is applied but TSan output remains unclear. Tracking the remainder in a follow-up issue (#2404).

Workflow counts SEGV-on-race as a finding

.github/workflows/tsan.yml summary step now tallies both WARNING: ThreadSanitizer and ERROR: ThreadSanitizer occurrences. Reports from both cwd/tsan-report.* and the dune build dir (_build/default/test/irmin-pack/test_tsan_stress/tsan-report.*) are uploaded to the workflow artifact.

Development recipe

Reproducible locally via the .devcontainer/devcontainer.json added in the first commit:

docker run --privileged --rm -v $PWD:/workspace -w /workspace \
  ghcr.io/tarides/ocaml-devcontainer-tsan:latest bash
# inside:
opam switch ocaml+tsan && eval $(opam env)
opam install -y --deps-only --no-depexts irmin irmin-pack irmin-fs
opam install -y eio_main
dune build test/irmin-pack/test_tsan_stress/main.exe
EIO_BACKEND=posix TSAN_OPTIONS='halt_on_error=0 history_size=7 log_path=tsan-report' \
  dune build @test/irmin-pack/test_tsan_stress/tsan-stress --force

Or VS Code → "Reopen in Container" with the committed devcontainer config.

What this does not do

Context: #2401 Phase 6.

Test plan

  • Add tsan label to this PR → workflow runs → five scenarios each produce a tsan-report.<pid> artifact.
  • Verify the step summary reports data-race warnings: N (≥2 — mem and watch) and signal/SEGV-on-race errors: M (≥2 — ao, dict).
  • Open the tsan-reports-<run> artifact; confirm mem and watch reports cite irmin_mem.ml:51 and watch.ml:33 respectively in the stack trace.
  • Merge Add TSan nightly CI workflow #2402 first, then this, then confirm the next nightly run still green-path succeeds (builds and uploads, regardless of finding count).
  • After Multicore safety protect shared mutable state with mutexes and atomics #2397 merges: confirm watch disappears from the nightly report, and follow-up issue tracks the remaining gaps.

🤖 Generated with Claude Code

cuihtlauac and others added 4 commits April 24, 2026 08:47
New .github/workflows/tsan.yml runs the test suite under
ThreadSanitizer on a nightly cron (plus manual and opt-in `tsan`
label on PRs). Runs the default suite alongside a new @tsan-stress
alias, with halt_on_error=0 so a single run surfaces every race
TSan observes. Reports are uploaded as a workflow artifact.

The existing multicore and QCheck-STM tests become scalable via
env vars: IRMIN_STM_ITER, IRMIN_STM_PACK_ITER, IRMIN_MULTICORE_DOMAINS,
IRMIN_MULTICORE_ITER. Defaults match prior behaviour, so normal
`dune runtest` is unchanged.

The @tsan-stress alias (test/irmin-pack/test_tsan_stress/) ships as
an empty dispatcher; per-hotspot scenarios (dict refill, irmin_mem
cache, watch globals, fs pool, append_only_file buffer) land in a
follow-up PR.

This adds detection only; no src/ changes. Expect the first nightly
run to surface several known races from mirage#2397 — that output is the
baseline for follow-up fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dune @fmt / ocaml-ci's lint-fmt rejected two short match/if
bindings that fit on a single line. No semantic change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins reproduction recipe for TSan race-hunting to the tarides
ocaml-devcontainer-tsan image (OCaml 5.4 with a pre-built +tsan
switch). VS Code "Reopen in Container" or `devcontainer up` then
`opam install --deps-only .` is enough to iterate on the
@tsan-stress suite locally, instead of building a +tsan switch
from scratch.

The CI workflow (.github/workflows/tsan.yml) is authoritative and
does not use this config — the devcontainer is dev convenience
only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five new scenarios under test/irmin-pack/test_tsan_stress/ target
mutable state hotspots that the existing multicore + QCheck-STM
tests do not exercise across domains:

- stress_mem_cache: races the global Hashtbl.t cache captured by
  Irmin_mem.Read_only.v (irmin_mem.ml:44) and the KMap mutable in
  the Read_only instance.
- stress_watch:     races the listen_dir_hook ref in watch.ml:28-29.
- stress_ao_buf:    races the unguarded Buffer.t in the rw_perm of
  Append_only_file (append_only_file.ml).
- stress_dict:      races the two unguarded Hashtbl.t caches plus
  last_refill_offset in dict.ml.
- stress_fs_pool:   races the shared Eio_pool instances in
  irmin_fs_unix.ml (mkdir_pool, openfile_pool).

mem and watch produce clean TSan data-race warnings pointing at the
exact hotspot; ao, dict and fs cause memory corruption fast enough
that TSan fires via SEGV before it can write a detailed report —
the SEGV itself is the race signal.

The @tsan-stress dune alias runs each scenario in its own process
via `|| true` so that a race-induced crash in one does not suppress
the others. The tsan.yml workflow counts both "WARNING:
ThreadSanitizer" and "ERROR: ThreadSanitizer" as findings, and
includes the @tsan-stress build-dir reports in the uploaded
artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant