Skip to content

Async proactor fixes: TSAN bridge and progress callback starvation#23699

Merged
benvanik merged 2 commits intomainfrom
users/benvanik/async-proactor-fixes
Mar 9, 2026
Merged

Async proactor fixes: TSAN bridge and progress callback starvation#23699
benvanik merged 2 commits intomainfrom
users/benvanik/async-proactor-fixes

Conversation

@benvanik
Copy link
Copy Markdown
Collaborator

@benvanik benvanik commented Mar 9, 2026

Two fixes for the async proactor infrastructure, both motivated by SHM carrier integration work.

TSAN bridge for kernel-mediated completion ordering

Proactor backends that use kernel-managed shared memory rings (io_uring SQ/CQ, IOCP completion ports) have an ordering gap invisible to TSAN: the submitter thread writes operation fields, the kernel processes the I/O, and the poll thread reads those fields from a completion event — but TSAN sees no userspace synchronization between the write and the read.

This adds a C11 atomic bridge (iree_async_operation_t::tsan_bridge) that makes the kernel-provided ordering visible. The submitter stores with release ordering after filling the operation; the completer loads with acquire ordering before reading operation fields. The field and both macros compile to nothing outside IREE_SANITIZER_THREAD builds.

Supporting changes:

  • iree_async_operation_zero(): Replaces raw memset for operation struct reuse. C11 forbids mixing atomic and non-atomic accesses to the same object; memset over atomic fields is a non-atomic write that TSAN correctly flags. The new helper zeroes only the subtype-specific tail, leaving the base struct's atomic fields to be set properly by iree_async_operation_initialize().

  • software_op_mask bitmap in io_uring submit: After Phase 2 submits kernel SQEs, another thread's flush can make them visible to the kernel. A fast-completing kernel op (NOP, expired timer) could have its operation struct reused before Phase 3 iterates past it. Previously Phase 3 re-read operation->type to decide whether to skip kernel ops — a data race with the new owner's writes. Now Phase 1 records which batch positions are software ops in a bitmap, and Phase 3 uses the bitmap instead of re-reading types.

  • CTS stress tests: Three test cases exercise the bridge under TSAN — single-thread rapid reuse (200 cycles), multi-thread contention (4 threads, 8-slot pool), and high-frequency reuse (2 threads, 2-slot pool forcing maximum reuse rate).

Progress callback starvation fix

All three proactor backends (io_uring, IOCP, POSIX) previously only forced non-blocking poll when progress callbacks returned > 0 completions. A newly registered callback that hadn't yet observed progress would not prevent the proactor from blocking in its platform wait (io_uring_enter, GetQueuedCompletionStatusEx, event_set_wait), starving the callback until an unrelated I/O event broke the wait.

Fix: force non-blocking poll whenever any progress callback is registered (progress_list != NULL), not just when one returns progress. The carrier's idle spin threshold naturally removes the callback when the ring goes quiet, bounding the busy-loop duration.

The POSIX backend also adds a pending queue drain after progress callbacks, ensuring that operations submitted by callbacks (e.g., notification re-posts) are registered with the event set before the blocking wait. Without this, their fds wouldn't be monitored and the poll would miss wakeups.

Notification signal_primitive

Adds a signal_primitive field to the io_uring notification struct, separate from the polled primitive. For local notifications both are the same eventfd. For shared/proxy notifications (SHM carrier peer wake), signal_primitive is the peer's eventfd while primitive may be NONE (not polled locally). This enables asymmetric notification patterns where the signaling end differs from the monitoring end.

benvanik and others added 2 commits March 8, 2026 19:17
Proactor backends using kernel-mediated completion (io_uring SQ/CQ rings)
have a submit→complete ordering gap invisible to TSAN. The submitter writes
operation fields, the kernel processes I/O, and the poll thread reads those
fields from a CQE — but TSAN sees no userspace synchronization.

Add a C11 atomic bridge (iree_async_operation_t::tsan_bridge) with
release on submit and acquire on completion. TSAN intercepts atomics
through compiler instrumentation, making the kernel-provided ordering
visible.

Key implementation details:
- iree_async_operation_zero() replaces raw memset for operation reuse,
  avoiding non-atomic writes to atomic fields (C11 violation TSAN flags).
- Phase 1 (SQE fill) records a software_op_mask bitmap of which batch
  positions are software ops vs kernel ops.
- Phase 3 (post-submit software dispatch) uses the bitmap to skip kernel
  ops without re-reading operation->type after the TSAN release — the
  operation is logically kernel-owned after submit.
- CQE processing acquires the bridge before reading any operation fields.

Includes CTS stress tests exercising single-thread reuse, multi-thread
contention (4 threads, 8 slots), and high-frequency reuse (2 threads,
2 slots) to validate the bridge under TSAN.

Co-Authored-By: Claude <noreply@anthropic.com>
…primitive

Two related changes for SHM carrier polling support:

Progress callback starvation fix: Previously, proactor backends only
forced non-blocking poll when progress callbacks returned > 0 completions.
This meant a newly registered callback could be starved until an unrelated
I/O event broke the blocking wait. Now backends force non-blocking poll
whenever any progress callback is registered (progress_list != NULL). The
carrier's idle spin threshold naturally removes the callback when the ring
goes quiet, bounding busy-loop duration.

Applies to all three backends: io_uring (io_uring_enter), IOCP (GQCS),
POSIX (event_set_wait/kqueue/epoll).

POSIX backend also adds a drain of the pending queue after progress
callbacks, ensuring operations submitted by callbacks (e.g., notification
re-posts) are registered with the event set before the blocking wait.

Notification signal_primitive: Adds a separate signal_primitive field to
notifications, distinct from the polled primitive. For local notifications,
both are the same eventfd. For shared/proxy notifications (SHM carrier),
signal_primitive is the peer's eventfd while primitive may be NONE (not
polled locally). This enables asymmetric notification patterns where the
signaling end differs from the monitoring end.

Co-Authored-By: Claude <noreply@anthropic.com>
@benvanik benvanik added the runtime Relating to the IREE runtime library label Mar 9, 2026
@benvanik benvanik marked this pull request as ready for review March 9, 2026 05:38
@benvanik benvanik added the post-merge-review Ben's special place. People can pick these up and review them for forward fixes if interested. label Mar 9, 2026
@benvanik benvanik requested a review from stellaraccident March 9, 2026 05:38
@benvanik benvanik merged commit 2604130 into main Mar 9, 2026
59 of 61 checks passed
@benvanik benvanik deleted the users/benvanik/async-proactor-fixes branch March 9, 2026 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

post-merge-review Ben's special place. People can pick these up and review them for forward fixes if interested. runtime Relating to the IREE runtime library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant