Skip to content

reactor: retry io_uring socket send on EAGAIN#3435

Open
sunyuechi wants to merge 1 commit into
scylladb:masterfrom
sunyuechi:wip-io-uring-retry-send-eagain
Open

reactor: retry io_uring socket send on EAGAIN#3435
sunyuechi wants to merge 1 commit into
scylladb:masterfrom
sunyuechi:wip-io-uring-retry-send-eagain

Conversation

@sunyuechi

Copy link
Copy Markdown
Contributor

The io_uring backend submits socket writes as IORING_OP_SENDMSG/SEND. When the send buffer is full the kernel can complete the op with -EAGAIN, which io_completion::complete_with() turns into a thrown system_error handed to the caller. The aio/epoll backends instead go through do_sendmsg(), which waits for POLLOUT and resends, so they never leak EAGAIN.

On an EAGAIN completion, fall back to the poll-based do_sendmsg()/ do_send() like the other backends. recvmsg()/recv_some() share the same path but reads do not fill a buffer, so they are left unchanged.

Observed under load on riscv64, where io_uring is the default backend: unittest-seastar-socket's test_preemptive_down() aborts with EAGAIN.

@tchaikov tchaikov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. but i'd suggest improve the commit message to note that this is caused by the kernel design gap in iouring's socket i/o error handling.

the normal inline path handles socket EAGAIN correctly. but under memory pressure, an allocation for arming a poll fails, the request is pushed to io-wq. io-wq then calls sendmsg in blocking mode, but SOCK_NONBLOCK sockets return EAGAIN immediately. io-wq doesn't retry for these sockets, and leak EAGAIN to userspace.

disk i/o retries on REQ_F_REISSUE, see torvalds/linux@d803d12. but socket i/o does not share this error handling with disk i/o.

in interesting thing is, liburing considers it a contract that io_uring should handle EAGAIN internally if MSG_DONTWAIT is not set in the SQE. see https://github.com/axboe/liburing/blob/a33a770d8379978660036251e14710c6ee8a3d34/test/socket-nb.c#L91-L93.

seastar's iouring backend does not set MSG_DONTWAIT in its non-speculate paths, so, per the kernel's contract, EAGAIN should not reach userspace.

probably we should file a bug against liburing / linux kernel on the gap between them.

@sunyuechi sunyuechi force-pushed the wip-io-uring-retry-send-eagain branch from 3b9bb49 to 5ec3652 Compare May 30, 2026 15:01
@sunyuechi

Copy link
Copy Markdown
Contributor Author

Thanks for the clear analysis. Added a note about the kernel gap to the commit msg. You wanna file the bug or should i? @tchaikov

@tchaikov

Copy link
Copy Markdown
Contributor

Thanks for the clear analysis. Added a note about the kernel gap to the commit msg. You wanna file the bug or should i? @tchaikov

i will try. still trying to get my head around the iouring kernel source.

BTW, my analysis was updated. as i realized that it's not about read/write versus send/recv, it's about disk i/o versus network i/o. as read/write ops on socket also suffer from this issue.

The io_uring backend submits socket writes as IORING_OP_SENDMSG/SEND.
When the send buffer is full the kernel can complete the op with -EAGAIN,
which io_completion::complete_with() turns into a thrown system_error
handed to the caller. The aio/epoll backends instead go through
do_sendmsg(), which waits for POLLOUT and resends, so they never leak
EAGAIN.

This is a gap in the kernel's io_uring error handling: disk I/O retries
-EAGAIN internally via REQ_F_REISSUE (kernel commit d803d123948f), but
socket I/O does not share that path, so -EAGAIN leaks to userspace.

On an EAGAIN completion, fall back to the poll-based do_sendmsg()/
do_send() like the other backends. recvmsg()/recv_some() share the same
path but reads do not fill a buffer, so they are left unchanged.

Observed under load on riscv64, where io_uring is the default backend:
unittest-seastar-socket's test_preemptive_down() aborts with EAGAIN.

Signed-off-by: Sun Yuechi <sunyuechi@iscas.ac.cn>
@sunyuechi sunyuechi force-pushed the wip-io-uring-retry-send-eagain branch from 5ec3652 to 1a4e991 Compare May 30, 2026 15:16
return _r.do_sendmsg(fd, iovs, len);
}
return make_exception_future<size_t>(e);
});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like adding a continuation here. Although it should be rare that we mis-speculate.

We can inherit from kernel_io_completion instead of io_completion so we capture the -EAGAIN and re-issue the operation from there.

Please do file a bug and add a reference to it in the comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants