Skip to content

Fix fd close race condition in xev bridge and io_uring#445

Merged
heartwilltell merged 1 commit into
codex/runtime-pollerfrom
claude/fix-poller-segfault-wyDje
Apr 26, 2026
Merged

Fix fd close race condition in xev bridge and io_uring#445
heartwilltell merged 1 commit into
codex/runtime-pollerfrom
claude/fix-poller-segfault-wyDje

Conversation

@heartwilltell

Copy link
Copy Markdown
Contributor

Summary

This PR fixes a critical race condition in the xev bridge's file descriptor close handling where completion operations could be reset while their CQEs were still pending in the io_uring ring, causing libxev to invoke completions with invalid .noop ops.

Key Changes

  • Extended drain loop in run_xev_close_fd: Increased drain count from 32 to 64 iterations to allow more time for cancel completions to retire before resetting completion storage.

  • Added cancel completion state checks: Modified the drain loop to also check read_cancel and write_cancel completion states, not just the main read/write completions, ensuring all pending operations are fully retired.

  • Conditional completion reset: Changed from unconditionally resetting all completion storage to only resetting completions that have actually retired (state != .active). This prevents overwriting storage while CQEs are still pending.

  • Added io_uring CQE safety check: Added a guard in the io_uring backend's CQE processing loop to skip completions that have been reset to .noop state, preventing unreachable code paths in Completion.invoke().

  • Added CANCELED error handling: Added explicit handling for CANCELED errno in poll operation results, properly converting kernel cancellation signals to error states.

Implementation Details

The root cause was that cancel completions could still have pending CQEs in the io_uring ring after the drain loop exhausted its budget. When the slot was then reused for a new fd, the old completion storage would be overwritten, causing the eventual CQE to reference invalid operation data. The fix uses a defensive approach: drain more aggressively, check all completion types, conditionally reset only retired completions, and add a safety check in the backend to handle any remaining edge cases gracefully.

https://claude.ai/code/session_01CSYWLMHrkzjcwCbEwHtCLT

The drain loop in run_xev_close_fd waited only for the original read/write
completions to retire, ignoring the cancel completions we submitted to
libxev. When the cancel CQE was still pending in the io_uring ring, the
function would zero slot.read_cancel = .{}, setting its op to .noop. When
the kernel eventually delivered that CQE, libxev's Completion.invoke()
hit `.noop => unreachable`, crashing with SIGSEGV in
test_poller_close_while_waiting.

Three layered fixes:

1. run_xev_bridge.zig: Drain loop now also waits for read_cancel and
   write_cancel completions to reach .dead before resetting their
   storage. Reset is now conditional: any completion that is somehow
   still .active (drain exhausted) is left alone so its eventual CQE
   finds a valid op.

2. libxev io_uring.zig: Mirror the existing kqueue.zig defensive guard —
   skip processing CQEs for completions whose flags.state is no longer
   .active. This prevents the unreachable crash even if a future caller
   resets a completion mid-flight.

3. libxev io_uring.zig: Handle .CANCELED in the .poll case so the
   cancelled poll's CQE no longer prints "unexpected errno: 125" to
   stderr.

https://claude.ai/code/session_01CSYWLMHrkzjcwCbEwHtCLT
@heartwilltell heartwilltell self-assigned this Apr 26, 2026
@heartwilltell heartwilltell merged commit 69f7f78 into codex/runtime-poller Apr 26, 2026
14 checks passed
@heartwilltell heartwilltell deleted the claude/fix-poller-segfault-wyDje branch April 26, 2026 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants