reactor: retry io_uring socket send on EAGAIN#3435
Conversation
There was a problem hiding this comment.
lgtm. but i'd suggest improve the commit message to note that this is caused by the kernel design gap in iouring's socket i/o error handling.
the normal inline path handles socket EAGAIN correctly. but under memory pressure, an allocation for arming a poll fails, the request is pushed to io-wq. io-wq then calls sendmsg in blocking mode, but SOCK_NONBLOCK sockets return EAGAIN immediately. io-wq doesn't retry for these sockets, and leak EAGAIN to userspace.
disk i/o retries on REQ_F_REISSUE, see torvalds/linux@d803d12. but socket i/o does not share this error handling with disk i/o.
in interesting thing is, liburing considers it a contract that io_uring should handle EAGAIN internally if MSG_DONTWAIT is not set in the SQE. see https://github.com/axboe/liburing/blob/a33a770d8379978660036251e14710c6ee8a3d34/test/socket-nb.c#L91-L93.
seastar's iouring backend does not set MSG_DONTWAIT in its non-speculate paths, so, per the kernel's contract, EAGAIN should not reach userspace.
probably we should file a bug against liburing / linux kernel on the gap between them.
3b9bb49 to
5ec3652
Compare
|
Thanks for the clear analysis. Added a note about the kernel gap to the commit msg. You wanna file the bug or should i? @tchaikov |
i will try. still trying to get my head around the iouring kernel source. BTW, my analysis was updated. as i realized that it's not about read/write versus send/recv, it's about disk i/o versus network i/o. as read/write ops on socket also suffer from this issue. |
The io_uring backend submits socket writes as IORING_OP_SENDMSG/SEND. When the send buffer is full the kernel can complete the op with -EAGAIN, which io_completion::complete_with() turns into a thrown system_error handed to the caller. The aio/epoll backends instead go through do_sendmsg(), which waits for POLLOUT and resends, so they never leak EAGAIN. This is a gap in the kernel's io_uring error handling: disk I/O retries -EAGAIN internally via REQ_F_REISSUE (kernel commit d803d123948f), but socket I/O does not share that path, so -EAGAIN leaks to userspace. On an EAGAIN completion, fall back to the poll-based do_sendmsg()/ do_send() like the other backends. recvmsg()/recv_some() share the same path but reads do not fill a buffer, so they are left unchanged. Observed under load on riscv64, where io_uring is the default backend: unittest-seastar-socket's test_preemptive_down() aborts with EAGAIN. Signed-off-by: Sun Yuechi <sunyuechi@iscas.ac.cn>
5ec3652 to
1a4e991
Compare
| return _r.do_sendmsg(fd, iovs, len); | ||
| } | ||
| return make_exception_future<size_t>(e); | ||
| }); |
There was a problem hiding this comment.
I don't like adding a continuation here. Although it should be rare that we mis-speculate.
We can inherit from kernel_io_completion instead of io_completion so we capture the -EAGAIN and re-issue the operation from there.
Please do file a bug and add a reference to it in the comment.
The io_uring backend submits socket writes as IORING_OP_SENDMSG/SEND. When the send buffer is full the kernel can complete the op with -EAGAIN, which io_completion::complete_with() turns into a thrown system_error handed to the caller. The aio/epoll backends instead go through do_sendmsg(), which waits for POLLOUT and resends, so they never leak EAGAIN.
On an EAGAIN completion, fall back to the poll-based do_sendmsg()/ do_send() like the other backends. recvmsg()/recv_some() share the same path but reads do not fill a buffer, so they are left unchanged.
Observed under load on riscv64, where io_uring is the default backend: unittest-seastar-socket's test_preemptive_down() aborts with EAGAIN.