Async proactor fixes: TSAN bridge and progress callback starvation#23699
Merged
Async proactor fixes: TSAN bridge and progress callback starvation#23699
Conversation
Proactor backends using kernel-mediated completion (io_uring SQ/CQ rings) have a submit→complete ordering gap invisible to TSAN. The submitter writes operation fields, the kernel processes I/O, and the poll thread reads those fields from a CQE — but TSAN sees no userspace synchronization. Add a C11 atomic bridge (iree_async_operation_t::tsan_bridge) with release on submit and acquire on completion. TSAN intercepts atomics through compiler instrumentation, making the kernel-provided ordering visible. Key implementation details: - iree_async_operation_zero() replaces raw memset for operation reuse, avoiding non-atomic writes to atomic fields (C11 violation TSAN flags). - Phase 1 (SQE fill) records a software_op_mask bitmap of which batch positions are software ops vs kernel ops. - Phase 3 (post-submit software dispatch) uses the bitmap to skip kernel ops without re-reading operation->type after the TSAN release — the operation is logically kernel-owned after submit. - CQE processing acquires the bridge before reading any operation fields. Includes CTS stress tests exercising single-thread reuse, multi-thread contention (4 threads, 8 slots), and high-frequency reuse (2 threads, 2 slots) to validate the bridge under TSAN. Co-Authored-By: Claude <noreply@anthropic.com>
…primitive Two related changes for SHM carrier polling support: Progress callback starvation fix: Previously, proactor backends only forced non-blocking poll when progress callbacks returned > 0 completions. This meant a newly registered callback could be starved until an unrelated I/O event broke the blocking wait. Now backends force non-blocking poll whenever any progress callback is registered (progress_list != NULL). The carrier's idle spin threshold naturally removes the callback when the ring goes quiet, bounding busy-loop duration. Applies to all three backends: io_uring (io_uring_enter), IOCP (GQCS), POSIX (event_set_wait/kqueue/epoll). POSIX backend also adds a drain of the pending queue after progress callbacks, ensuring operations submitted by callbacks (e.g., notification re-posts) are registered with the event set before the blocking wait. Notification signal_primitive: Adds a separate signal_primitive field to notifications, distinct from the polled primitive. For local notifications, both are the same eventfd. For shared/proxy notifications (SHM carrier), signal_primitive is the peer's eventfd while primitive may be NONE (not polled locally). This enables asymmetric notification patterns where the signaling end differs from the monitoring end. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two fixes for the async proactor infrastructure, both motivated by SHM carrier integration work.
TSAN bridge for kernel-mediated completion ordering
Proactor backends that use kernel-managed shared memory rings (io_uring SQ/CQ, IOCP completion ports) have an ordering gap invisible to TSAN: the submitter thread writes operation fields, the kernel processes the I/O, and the poll thread reads those fields from a completion event — but TSAN sees no userspace synchronization between the write and the read.
This adds a C11 atomic bridge (
iree_async_operation_t::tsan_bridge) that makes the kernel-provided ordering visible. The submitter stores with release ordering after filling the operation; the completer loads with acquire ordering before reading operation fields. The field and both macros compile to nothing outsideIREE_SANITIZER_THREADbuilds.Supporting changes:
iree_async_operation_zero(): Replaces rawmemsetfor operation struct reuse. C11 forbids mixing atomic and non-atomic accesses to the same object;memsetover atomic fields is a non-atomic write that TSAN correctly flags. The new helper zeroes only the subtype-specific tail, leaving the base struct's atomic fields to be set properly byiree_async_operation_initialize().software_op_maskbitmap in io_uring submit: After Phase 2 submits kernel SQEs, another thread's flush can make them visible to the kernel. A fast-completing kernel op (NOP, expired timer) could have its operation struct reused before Phase 3 iterates past it. Previously Phase 3 re-readoperation->typeto decide whether to skip kernel ops — a data race with the new owner's writes. Now Phase 1 records which batch positions are software ops in a bitmap, and Phase 3 uses the bitmap instead of re-reading types.CTS stress tests: Three test cases exercise the bridge under TSAN — single-thread rapid reuse (200 cycles), multi-thread contention (4 threads, 8-slot pool), and high-frequency reuse (2 threads, 2-slot pool forcing maximum reuse rate).
Progress callback starvation fix
All three proactor backends (io_uring, IOCP, POSIX) previously only forced non-blocking poll when progress callbacks returned > 0 completions. A newly registered callback that hadn't yet observed progress would not prevent the proactor from blocking in its platform wait (
io_uring_enter,GetQueuedCompletionStatusEx,event_set_wait), starving the callback until an unrelated I/O event broke the wait.Fix: force non-blocking poll whenever any progress callback is registered (
progress_list != NULL), not just when one returns progress. The carrier's idle spin threshold naturally removes the callback when the ring goes quiet, bounding the busy-loop duration.The POSIX backend also adds a pending queue drain after progress callbacks, ensuring that operations submitted by callbacks (e.g., notification re-posts) are registered with the event set before the blocking wait. Without this, their fds wouldn't be monitored and the poll would miss wakeups.
Notification
signal_primitiveAdds a
signal_primitivefield to the io_uring notification struct, separate from the polledprimitive. For local notifications both are the same eventfd. For shared/proxy notifications (SHM carrier peer wake),signal_primitiveis the peer's eventfd whileprimitivemay beNONE(not polled locally). This enables asymmetric notification patterns where the signaling end differs from the monitoring end.