Fix fd close race condition in xev bridge and io_uring#445
Merged
heartwilltell merged 1 commit intoApr 26, 2026
Conversation
The drain loop in run_xev_close_fd waited only for the original read/write
completions to retire, ignoring the cancel completions we submitted to
libxev. When the cancel CQE was still pending in the io_uring ring, the
function would zero slot.read_cancel = .{}, setting its op to .noop. When
the kernel eventually delivered that CQE, libxev's Completion.invoke()
hit `.noop => unreachable`, crashing with SIGSEGV in
test_poller_close_while_waiting.
Three layered fixes:
1. run_xev_bridge.zig: Drain loop now also waits for read_cancel and
write_cancel completions to reach .dead before resetting their
storage. Reset is now conditional: any completion that is somehow
still .active (drain exhausted) is left alone so its eventual CQE
finds a valid op.
2. libxev io_uring.zig: Mirror the existing kqueue.zig defensive guard —
skip processing CQEs for completions whose flags.state is no longer
.active. This prevents the unreachable crash even if a future caller
resets a completion mid-flight.
3. libxev io_uring.zig: Handle .CANCELED in the .poll case so the
cancelled poll's CQE no longer prints "unexpected errno: 125" to
stderr.
https://claude.ai/code/session_01CSYWLMHrkzjcwCbEwHtCLT
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes a critical race condition in the xev bridge's file descriptor close handling where completion operations could be reset while their CQEs were still pending in the io_uring ring, causing libxev to invoke completions with invalid
.noopops.Key Changes
Extended drain loop in
run_xev_close_fd: Increased drain count from 32 to 64 iterations to allow more time for cancel completions to retire before resetting completion storage.Added cancel completion state checks: Modified the drain loop to also check
read_cancelandwrite_cancelcompletion states, not just the main read/write completions, ensuring all pending operations are fully retired.Conditional completion reset: Changed from unconditionally resetting all completion storage to only resetting completions that have actually retired (state != .active). This prevents overwriting storage while CQEs are still pending.
Added io_uring CQE safety check: Added a guard in the io_uring backend's CQE processing loop to skip completions that have been reset to
.noopstate, preventing unreachable code paths inCompletion.invoke().Added CANCELED error handling: Added explicit handling for
CANCELEDerrno in poll operation results, properly converting kernel cancellation signals to error states.Implementation Details
The root cause was that cancel completions could still have pending CQEs in the io_uring ring after the drain loop exhausted its budget. When the slot was then reused for a new fd, the old completion storage would be overwritten, causing the eventual CQE to reference invalid operation data. The fix uses a defensive approach: drain more aggressively, check all completion types, conditionally reset only retired completions, and add a safety check in the backend to handle any remaining edge cases gracefully.
https://claude.ai/code/session_01CSYWLMHrkzjcwCbEwHtCLT