Skip to content

Async and base infrastructure for cross-process shared memory.#23688

Merged
benvanik merged 20 commits intomainfrom
users/benvanik/async-improvements
Mar 7, 2026
Merged

Async and base infrastructure for cross-process shared memory.#23688
benvanik merged 20 commits intomainfrom
users/benvanik/async-improvements

Conversation

@benvanik
Copy link
Copy Markdown
Collaborator

@benvanik benvanik commented Mar 7, 2026

This is the first half of the remote HAL transport stack, split out to land independently. Everything here is in iree/async/, iree/base/, and iree/testing/ — no net or HAL changes yet.

Shared memory primitives

Adds iree_shm_* for cross-platform shared memory create/open/close/map with handle passing support (memfd on Linux, shm_open on macOS, CreateFileMappingW on Windows). Includes memory sealing via iree_shm_seal() for making regions immutable after population — uses kernel-level F_SEAL_* on Linux, VirtualProtect defense-in-depth on Windows, unavailable on macOS.

On top of that, a lock-free SPSC queue (iree_spsc_queue_*) designed to operate on caller-provided memory so it can live in a shared memory region. Monotonically increasing 64-bit positions with acquire-release ordering, cache-line isolation between producer and consumer, skip markers for wrap-free reads. This is the data plane for the SHM carrier.

Cross-process notification

Extends the notification system to support shared-memory epochs and cross-process wake across all three backends. The core change is an epoch_ptr indirection so notifications can point at either the inline epoch (local, zero behavioral change) or a caller-provided epoch in shared memory.

Per-platform wake mechanisms:

  • io_uring: shared futex (FUTEX mode) or caller-provided eventfd (EVENT mode)
  • POSIX/Linux: shared futex + caller eventfd for poll loop wake
  • POSIX/macOS: poll() on wake fd (no futex, no cross-process condvar)
  • IOCP: RegisterWaitForSingleObject bridges caller Event to IOCP; WakeByAddress is per-process (virtual-address keyed), so cross-process wake uses the SetEvent path

Shared buffer pool

Extends iree_async_buffer_pool_t with create_shared/open_shared for cross-process zero-copy. The atomic freelist (64-bit CAS, position-independent indices) lives directly in shared memory so both processes can independently acquire and release buffers. Header magic is written last as a commit step so openers never see a valid header with uninitialized freelist state.

Proactor performance improvements

  • IOCP event waits: Replace polling-based event wait with NtAssociateWaitCompletionPacket for zero-overhead kernel-level event-to-IOCP association.
  • IOCP carrier freelist: Eliminates per-I/O heap allocation in steady state by recycling carrier structs through an atomic slist freelist. Pool grows on demand, drained at destroy.
  • IOCP active_carriers: Doubly-linked list for O(1) carrier removal (was O(n) list walk on every event wait completion).
  • Timer insertion fast-path: Tail comparison before list walk turns the common case (monotonically-increasing deadlines) from O(n) to O(1).
  • Inline progress callbacks: Proactors can register progress callbacks that run inline during the event loop, enabling adaptive polling strategies without full wakeup overhead.

Bug fixes

  • Send data lifetime in io_uring: Data referenced by send SQEs submitted from recv callbacks could be a use-after-return if the submitting function's stack frame unwound before the kernel read the data. Fixed by copying send data inline when immediate submission is possible. New data_lifetime_test in the socket CTS.
  • Axis failure propagation: When an axis operation failed, the failure status was silently dropped instead of propagating to semaphores waiting on that axis value.

Testing infrastructure

Adds a coordinated multi-process test harness (iree/testing/). A single test binary re-executes itself in different roles, with the launcher orchestrating spawn order, readiness synchronization, and result collection. Supports Linux, macOS, and Windows. First consumer is the SHM carrier cross-process tests on the remote-hal branch.

Also adds iree_async_primitive_dup()/close() for cross-process handle transfer — the async primitive layer's equivalents of dup()/close(), DuplicateHandle/CloseHandle, mach_port_mod_refs.

@benvanik benvanik added the runtime Relating to the IREE runtime library label Mar 7, 2026
benvanik and others added 19 commits March 7, 2026 02:09
Add iree_shm_* API for cross-platform shared memory create/open/close
with handle passing support. This is the foundation for the SHM carrier
stack (SPSC ring, cross-process notification, SHM carrier).

Platform implementations:
- Linux: memfd_create + mmap, with F_ADD_SEALS to prevent peer truncation
- macOS: shm_open + immediate shm_unlink + mmap, with EEXIST retry loop
- Windows: CreateFileMappingW + MapViewOfFile, Local\ namespace for named

Includes fstat size validation on POSIX open paths to fail early on size
mismatches (matching Windows MapViewOfFile behavior, avoiding SIGBUS).

Co-Authored-By: Claude <noreply@anthropic.com>
Adds a lock-free single-producer single-consumer queue operating on
caller-provided memory, designed for cross-process shared memory
regions. The queue uses monotonically increasing 64-bit positions with
acquire-release ordering and cache-line isolation between producer and
consumer fields to eliminate false sharing.

Entry format: 4-byte length prefix + payload + alignment padding.
Skip markers (UINT32_MAX) at the data region tail signal the consumer
to wrap to offset 0, avoiding split reads across the wrap boundary.

The header contains a magic number and ABI version for strict
validation on open — no forward compatibility.

API includes one-shot write/read, zero-copy begin_write/commit_write
with deferred data writes (all mutations happen in commit_write behind
a single release store), and peek/consume for zero-copy reads.

Includes 26 gtest cases (initialization, validation, wrapping, skip
markers, two-phase writes, and 4 multi-threaded stress tests) and
Google Benchmark for throughput/latency measurement.

Co-Authored-By: Claude <noreply@anthropic.com>
Extend the notification system to support shared-memory epochs and
cross-process wake mechanisms across all three backends (io_uring,
POSIX, IOCP). This is a prerequisite for the SHM carrier, which will
map a shared memory region containing an epoch counter and use the
notification signal/wait API for cross-process wakeup.

Core change: add epoch_ptr indirection so notifications can point at
either the inline epoch (local, zero behavioral change) or a
caller-provided epoch in shared memory. All ~30 call sites mechanically
change from &notification->epoch to notification->epoch_ptr.

Add IREE_ASYNC_NOTIFICATION_FLAG_SHARED flag that controls three
behavioral branches: destroy skips closing caller-owned primitives,
Linux futex calls omit FUTEX_PRIVATE_FLAG for physical-page hashing,
and macOS condvar initialization is skipped (process-local, useless
cross-process).

Per-platform details:
- io_uring FUTEX mode: futex on shared address suffices, no extra fd.
- io_uring EVENT mode: uses caller-provided eventfd for POLL_ADD+READ.
- POSIX Linux: shared futex + caller eventfd/pipe for poll loop wake.
- POSIX macOS: poll() on wake fd for sync wait (no futex, no condvar).
- IOCP: RegisterWaitForSingleObject bridges caller Event to IOCP;
  WaitOnAddress/WakeByAddress work cross-process natively on Windows.

New API: iree_async_notification_create_shared() with options struct
specifying the epoch address, wake primitive, and signal primitive.

Includes shared futex variants (iree_futex_wait_shared/wake_shared)
that omit FUTEX_PRIVATE_FLAG on Linux for cross-process operation.

CTS: 8 new test cases exercising shared epoch signal/query, sync/async
wait, destroy-doesn't-close-primitives, multiple cycles, two
notifications on one epoch, cross-notification sync wait, and timeout.
Tests run on all backends including a new io_uring_no_futex variant
that exercises the EVENT mode shared path.

Co-Authored-By: Claude <noreply@anthropic.com>
Use epoch_ptr indirection instead of direct epoch field access for
shared notifications whose epoch lives in shared memory. Fix comment
about WaitOnAddress being per-process (virtual address keyed).

Co-Authored-By: Claude <noreply@anthropic.com>
Most timer registrations use monotonically-increasing deadlines
(connection timeouts, heartbeats, RPC deadlines), so comparing against
the tail before walking turns the common case from O(n) to O(1).

Also removes the now-dead walk-to-end branch since the fast-path
handles all deadline >= tail cases.

Co-Authored-By: Claude <noreply@anthropic.com>
Add a prev pointer to iree_async_iocp_carrier_t so that event wait
carrier removal from the active_carriers list is O(1) instead of O(n).
The completion dispatch path previously walked the entire list to find
the predecessor; at scale (128+ registered event sources) this adds up.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds support for registering inline progress callbacks on proactors,
enabling efficient polling-based progress notification without requiring
full event loop wakeups.


Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Replace polling-based event wait with NT native wait completion packet
API for zero-overhead kernel-level event-to-IOCP association.

Co-Authored-By: Claude <noreply@anthropic.com>
Add iree_shm_seal() and iree_shm_query_seals() to support making shared
memory regions immutable after population (e.g. sealing model weights).

Platform implementations:
- Linux: kernel-level sealing via memfd F_SEAL_* (anonymous regions
  only). SEAL_WRITE uses munmap/seal/remap to avoid ASAN interference
  with mprotect-based approaches. On failure, mapping is either rolled
  back or fully torn down to prevent half-valid state.
- Windows: VirtualProtect(PAGE_READONLY) defense-in-depth for WRITE
  seal, VirtualQuery for querying. SHRINK/GROW/SEAL are inherent no-ops.
- macOS: returns IREE_STATUS_UNAVAILABLE (no kernel sealing support).

Also fixes memfd_create to pass MFD_ALLOW_SEALING (was missing, causing
all fcntl seal operations to silently fail) and makes iree_shm_map_fd
seal-aware so that opening a write-sealed region maps it read-only.

Co-Authored-By: Claude <noreply@anthropic.com>
Eliminates per-I/O heap allocation in steady state by recycling carrier
structs through an atomic slist freelist. Carriers are pushed to the
freelist on completion dispatch instead of being freed, and popped on
the next submit instead of malloc. The pool grows on demand (no
preseeding) and is drained at proactor destroy.

Co-Authored-By: Claude <noreply@anthropic.com>
Extend iree_async_buffer_pool_t with create_shared/open_shared for
cross-process zero-copy via shared memory. The atomic freelist (64-bit
CAS, position-independent indices) is placed directly in caller-provided
shared memory so both processes can independently acquire and release
buffers from a single pool.

Shared memory layout uses IREE_STRUCT_LAYOUT with cache-line-isolated
sections: immutable header (magic/version/geometry), freelist packed
state, and slot array. Header magic is written last as a commit step
so openers never see a valid header with uninitialized freelist state.

CTS test suite covers storage sizing, create/open lifecycle, header
validation (magic, version, buffer_size, buffer_count, alignment,
memory size), cross-handle freelist visibility, buffer data coherence
across mappings, no-duplicate-index guarantees, and concurrent
multi-thread stress.

Co-Authored-By: Claude <noreply@anthropic.com>
Add iree_async_primitive_dup() and iree_async_primitive_close() for
cross-process handle transfer. These are the async primitive layer's
equivalents of dup()/close() (POSIX), DuplicateHandle/CloseHandle
(Windows), and mach_port_mod_refs/deallocate (macOS).

Needed by the SHM handshake to duplicate shared_wake signal primitives
for IPC exchange, and by the factory to dup accepted socket primitives
before passing them to the handshake (which takes ownership).

Also adds primitive_test covering construction helpers (none, is_none,
make, from_fd, from_win32_handle, from_mach_port), dup semantics
(NONE fails, produces independent handle, multiple dups independent),
and close semantics (NONE/NULL noop, sets to NONE).

Co-Authored-By: Claude <noreply@anthropic.com>
Multi-process test infrastructure for IREE. A single test binary
re-executes itself in different roles, with the launcher orchestrating
spawn order, readiness synchronization, and result collection.

The harness provides:
- Role-based dispatch via --iree_test_role=<name> flags
- Ready file protocol for ordered startup (server before client)
- Shared temp directory for inter-process data exchange
- Overall timeout with forcible kill on expiry
- Exit code collection and summary reporting
- IREE_TRACE_ZONE instrumentation on all substantial operations

Platform support:
- Linux: /proc/self/exe + posix_spawn
- macOS: _NSGetExecutablePath + posix_spawn
- Windows: GetModuleFileNameA + CreateProcessA

Two usage patterns:
- Gtest integration: link coordinated_test_main instead of gtest_main,
  register config with IREE_COORDINATED_TEST_REGISTER, call
  iree_coordinated_test_run() from TEST bodies
- Standalone: call iree_coordinated_test_main() from main()

First consumer will be SHM carrier cross-process tests.

Co-Authored-By: Claude <noreply@anthropic.com>
Ensures data referenced by send SQEs submitted from recv callbacks is
read before the submitting function's stack frame unwinds. Fixes
io_uring proactor to copy send data inline when immediate submission
is possible, avoiding use-after-return for stack-allocated buffers.

Adds data_lifetime_test to the socket CTS to verify correct behavior.

Co-Authored-By: Claude <noreply@anthropic.com>
Ensures that when an axis operation fails, the failure status is
correctly propagated to all semaphores waiting on that axis value,
rather than silently dropping the error.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
writev() can return a short write (fewer bytes than requested) when the
kernel send buffer is nearly full. The POSIX proactor was treating any
positive return as a complete send, dropping unsent bytes. This is
particularly visible on macOS with small SO_SNDBUF values where writev()
routinely returns partial byte counts.

Fix execute_send to detect short writes, advance the iovec past bytes
already consumed, and return WOULD_BLOCK so the operation stays in the
chain for POLLOUT-driven retry. bytes_sent accumulates across retries.

Co-Authored-By: Claude <noreply@anthropic.com>
- Android: return UNAVAILABLE from iree_shm_create_named and
  iree_shm_open_named since bionic lacks shm_open/shm_unlink.
  Anonymous shared memory (memfd_create) is unaffected.
- Coordinated tests: detect binfmt_misc interpreters (QEMU user-mode,
  Wine, FEX-Emu) by comparing /proc/self/exe against argv[0]. When
  they differ, use argv[0] for child re-execution since the kernel
  transparently invokes the interpreter via binfmt_misc.
- SharedBufferPoolTest: replace memset with C++ value initialization
  to fix GCC -Werror=class-memaccess on types with default initializers.

Co-Authored-By: Claude <noreply@anthropic.com>
@benvanik benvanik force-pushed the users/benvanik/async-improvements branch from ab77dc1 to a91a278 Compare March 7, 2026 10:41
- shm_test: exclude named SHM tests on Android (bionic lacks
  shm_open/shm_unlink). Anonymous SHM tests still run.
- coordinated_test_test: add noriscv label to exclude under QEMU
  user-mode where the test cannot re-exec children without binfmt_misc.

Co-Authored-By: Claude <noreply@anthropic.com>
@benvanik benvanik marked this pull request as ready for review March 7, 2026 16:32
@benvanik benvanik merged commit 3a4c991 into main Mar 7, 2026
61 checks passed
@benvanik benvanik deleted the users/benvanik/async-improvements branch March 7, 2026 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

runtime Relating to the IREE runtime library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants