Skip to content

Conversation

@avikivity
Copy link
Member

@avikivity avikivity commented Jan 19, 2026

This series adds a reactor backend that does not use linux-aio, even for
disk I/O. Instead, it uses the syscall thread for disk I/O.

The new backend is meant for deterministic replay environments such as rr[1]
that don't support aio APIs like linux-aio or io_uring. They emulate threading
by scheduling one thread at a time, so the limitations of the syscall thread aren't
material.

The series also improves tolerance to ENOSYS on such platforms.

[1] https://rr-project.org/

pollable_fd::write_some() actually translates to send(), which works
only on sockets, not all file descriptors. If we want to support
write() on non-sockets, we'll need methods that plumb down
to read/write. To avoid confusion, rename the current methods to recv/send
to make it clear what they are actually implemented with.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a pure-epoll reactor backend for Seastar that doesn't use linux-aio, enabling support for deterministic replay environments like rr that don't support AIO APIs. The implementation uses the syscall thread for disk I/O instead of linux-aio or io_uring, and improves tolerance to ENOSYS errors on platforms that don't support these syscalls.

Changes:

  • Introduced reactor_backend_epoll_base as a base class by refactoring common functionality from reactor_backend_epoll
  • Added reactor_backend_epoll_pure backend that uses syscall threads for disk I/O
  • Separated write_* methods (for regular file descriptors) from send_* methods (for network sockets) in the pollable_fd API for better semantic clarity

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/core/reactor_backend.hh Introduces reactor_backend_epoll_base base class and reactor_backend_epoll_pure derived class; refactors class hierarchy
src/core/reactor_backend.cc Implements pure-epoll backend with syscall thread I/O, refactors epoll backend to use base class, adds ENOSYS error tolerance
src/core/reactor.cc Adds write_all/do_write/do_writev methods for regular file descriptors, renames socket methods to send/recv, improves error handling
include/seastar/core/reactor.hh Adds friend declarations for new backend classes and method declarations for write operations
include/seastar/core/internal/pollable_fd.hh Separates write_* API (file descriptors) from send_* API (network sockets), renames msghdr methods
src/util/process.cc Updates pipe implementation to use pollable_fd and new write_all API instead of io_queue
src/net/posix-stack.cc Updates network code to use renamed send_all/recv/send methods for network operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

auto r = _r._io_sink.drain([this] (const internal::io_request& req, io_completion* completion) {
memory::scoped_critical_alloc_section _;
using o = internal::io_request::operation;
// The returned future will be used to satify the promise in io_completion,
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "satify" should be "satisfy".

Suggested change
// The returned future will be used to satify the promise in io_completion,
// The returned future will be used to satisfy the promise in io_completion,

Copilot uses AI. Check for mistakes.
@avikivity
Copy link
Member Author

v2: fixed typo in comment pointed out by @copilot

@nyh
Copy link
Contributor

nyh commented Jan 19, 2026

Ten years ago (!) I opened an issue asking for this - #66.
After six years, I gave up on the idea and closed the issue.
I guess as they say, better late than never!

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I've waited for this for ten years (#66) :-)

How did you test this patch, and how do we test all the different reactor backends going forward? Can we run some set of tests for all the existing reactor backends, for example? Do we do anything of this sort?

reactor::do_write(pollable_fd_state& fd, const void* buffer, size_t len) {
return writeable(fd).then([this, &fd, buffer, len] () mutable {
auto r = fd.fd.write(buffer, len);
if (!r) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you expect write() to return 0, and not an error? Why is calling the function again immediately the right thing to do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you expect write() to return 0, and not an error?

The kernel buffer full. writeable() return a ready future due to bad speculation.

Why is calling the function again immediately the right thing to do?

writeable() will return a future that is ready when the buffer is not full.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misunderstanding what I'm seeing here. This is a blocking write to the filesystem, not a socket write, right? So why would the kernel return 0 bytes instead of waiting until it can write at least some bytes? Did you see this behavior in practice, or documented somewhere, or it's just defensive programming just in case it ever happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a non-blocking write to a non-disk, non-socket file descriptor (pollable_fd -> non-disk).

The non-socket pollable fds we support are eventfd (for inter-thread communication) and pipes (for communicating with subprocesses).


future<>
reactor::write_all(pollable_fd_state& fd, const void* buffer, size_t len) {
SEASTAR_ASSERT(len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this assert necessary? Why is writing 0 bytes not a perfectly valid thing to do (it will write nothing)?
I see below that for len==0, you won't even call this function. So this assert will never trigger. But, still, why even bother with it - it's still a useless if(). It seems to me that write_all_part will work fine even if given len=0. Or am I missing something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary.

}
return make_ready_future();
});
return _fd.write_all(buf.get(), buf_size).then([buf = std::move(buf)] {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not experienced enough with how this works, so I want to ask:
When aio is used, does the new code continue to do the same thing it did in the past (using aio)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will use write(2). Similar to how socket writes use send*(2).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. If we use the old "epoll" backend which uses (right?) classic AIO, did something change? Before it didn't use write()? What did it use?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch doesn't change anything. It adds some methods.

The following patch changes pipe writes from using our disk write mechanism (which happened to work with linux-aio but won't with pwrite()) to these new methods.

The trigger is that epoll_pure will issue disk writes with pwrite(). These happen not to work with pipes.

// (such as timers, signals, inter-thread notifications) into file descriptors
// using mechanisms like timerfd, signalfd and eventfd respectively.
class reactor_backend_epoll : public reactor_backend {
class reactor_backend_epoll_base : public reactor_backend {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be useful to explain in the comment why this is the epoll "base", what doesn't it do, and what does the one without "base" in its name add to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

friend struct hrtimer_aio_completion;
friend class reactor_backend_epoll_base;
friend class reactor_backend_epoll;
friend class reactor_backend_epoll_pure;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the word "pure" doesn't quite explain what this backend does. It isn't just a more "pure" use of epoll, it's actually entirely dependent on worker threads (#66). I wonder if it should be called something like epoll_thread or something. Or maybe just a comment is needed to explain what it is - whatever name we use for it.

@nyh nyh mentioned this pull request Jan 19, 2026
@nyh
Copy link
Contributor

nyh commented Jan 19, 2026

Oh, and I think the name "pure-epoll" is a bit misleading, since a big part of how it works is those worker threads. Also "pure epoll" sounds somehow better than unpure epoll, while the new reactor backend will probably be the least efficient we have and shouldn't be used unless you must, right?
As I noted in #66, qemu for example calls this technique aio=threads, focusing on the thread aspect. I wonder if we shouldn't use the word "threads" in the name of our reactor as well. Maybe epoll-threads?
But it's your call. I don't feel I understand this well enough to definitely recommend any name.

@mykaul
Copy link
Contributor

mykaul commented Jan 19, 2026

The new backend is meant for deterministic replay environments such as rr[1]

[1] ?

Add a way to call the write() syscall asynchronously, for file
descriptors that aren't sockets, like pipes. We already had read().
Pipes aren't filesystem files, and therefore don't support pwritev(),
which is what we use for filesystem files. It happened to work, likely
because linux-aio smoothed things over for us, but won't work with the
real pwrite(), which we'll use when linux-aio is not available.
@avikivity
Copy link
Member Author

The new backend is meant for deterministic replay environments such as rr[1]

[1] ?

Fixed.

@avikivity
Copy link
Member Author

v3:

  • removed bad assert in write_all()
  • added comments to the three epoll-related reactor backend classes

@avikivity
Copy link
Member Author

Oh, and I think the name "pure-epoll" is a bit misleading, since a big part of how it works is those worker threads. Also "pure epoll" sounds somehow better than unpure epoll, while the new reactor backend will probably be the least efficient we have and shouldn't be used unless you must, right? As I noted in #66, qemu for example calls this technique aio=threads, focusing on the thread aspect. I wonder if we shouldn't use the word "threads" in the name of our reactor as well. Maybe epoll-threads? But it's your call. I don't feel I understand this well enough to definitely recommend any name.

epoll isn't about how to issue I/O (we don't write to sockets with epoll, we use sendmsg). It's about the readiness/completion mechanism.

reactor_backend_epoll_pure uses only epoll for readiness/completion
reactor_backend_epoll uses epoll for non-disk readiness and linux-aio for disk issue and completions
reactor_backend_aio uses linux-aio for readiness/completion and disk issue, while issuing non-disk I/O via syscalls
reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls
reactor_backend_tbd will use io_uring for everything, bypassing syscalls for almost everything

@nyh
Copy link
Contributor

nyh commented Jan 19, 2026

epoll isn't about how to issue I/O (we don't write to sockets with epoll, we use sendmsg). It's about the readiness/completion mechanism.

reactor_backend_epoll_pure uses only epoll for readiness/completion reactor_backend_epoll uses epoll for non-disk readiness and linux-aio for disk issue and completions reactor_backend_aio uses linux-aio for readiness/completion and disk issue, while issuing non-disk I/O via syscalls reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls reactor_backend_tbd will use io_uring for everything, bypassing syscalls for almost everything

That's true. I guess I find it confusing that the backend is named after only the readiness part (and as you saw in the old "epoll" one, sometimes just half of the readiness part), while both readiness and issue is important to understand how a backend works. As you noted, in the future you'll want another one "tbd", which will use io_uring for readiness but also for issue - how will you call this new backend without mentioning the issue?

I thought it makes sense to include both the names of the readiness mechanism and the issue mechanism in the backend's name, especially in this new backend's case where the issue mechanism is such a critical part of the implementation (and what makes it so inefficient and unrecommended except in specific scenarios where you need it).

But you're right, none of the existing backends mention their issue mechanism in their names, so we don't have to start now. But I do think we need clear documentation and/or comments on what these different backends are. I don't think that a year from now, it will be easy to guess what "epoll" is vs "epoll_pure", and which one to use when.

@avikivity
Copy link
Member Author

epoll isn't about how to issue I/O (we don't write to sockets with epoll, we use sendmsg). It's about the readiness/completion mechanism.
reactor_backend_epoll_pure uses only epoll for readiness/completion reactor_backend_epoll uses epoll for non-disk readiness and linux-aio for disk issue and completions reactor_backend_aio uses linux-aio for readiness/completion and disk issue, while issuing non-disk I/O via syscalls reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls reactor_backend_tbd will use io_uring for everything, bypassing syscalls for almost everything

That's true. I guess I find it confusing that the backend is named after only the readiness part (and as you saw in the old "epoll" one, sometimes just half of the readiness part), while both readiness and issue is important to understand how a backend works. As you noted, in the future you'll want another one "tbd", which will use io_uring for readiness but also for issue - how will you call this new backend without mentioning the issue?

It's really about what's special about the backend. pure epoll is special in that it doesn't use linux-aio. epoll (which is really mixed epoll/aio) is where things started, so it was named so to distinguish it from aio. aio introduced pure aio for socket readiness and disk issue. io_uring doesn't use io_uring for non-disk issue, but since there's only one io_uring backend so far, it wasn't necessary to distinguish it.

I thought it makes sense to include both the names of the readiness mechanism and the issue mechanism in the backend's name, especially in this new backend's case where the issue mechanism is such a critical part of the implementation (and what makes it so inefficient and unrecommended except in specific scenarios where you need it).

The names are all historical and made sense at a point in time when they were introduced. They are not systematic.

But you're right, none of the existing backends mention their issue mechanism in their names, so we don't have to start now. But I do think we need clear documentation and/or comments on what these different backends are. I don't think that a year from now, it will be easy to guess what "epoll" is vs "epoll_pure", and which one to use when.

The default should be fine.

@avikivity
Copy link
Member Author

Looks good. I've waited for this for ten years (#66) :-)

How did you test this patch, and how do we test all the different reactor backends going forward?

I set the backend in ~/.config/seastar and ran the test suite.

Can we run some set of tests for all the existing reactor backends, for example?

Seems a waste to do it for every run, most don't touch this area.

Do we do anything of this sort?

No.

auto& req_writev = req.as<o::writev>();
return ::pwritev(req_writev.fd, req_writev.iovec, req_writev.iov_len, req_writev.pos);
}
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should step on fsync_io_desc's request submitted from reactor::fdatasync()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I guess I didn't see this because I have unsafe-bypass-fsync set in my config. I'll reproduce and fix.

@xemul
Copy link
Contributor

xemul commented Jan 20, 2026

reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls

But that's not quite so. Both sendmsg/recvmsg of uring backend only do syscalls for socket if they see the speculation marks on pollable fd, otherwise the issue uring CBs

@avikivity
Copy link
Member Author

reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls

But that's not quite so. Both sendmsg/recvmsg of uring backend only do syscalls for socket if they see the speculation marks on pollable fd, otherwise the issue uring CBs

Ah, forgot. So it's more nuanced.

The --aio-fsync tells the reactor whether to use linux-aio for fdatasync
operations. Its main use is to test support for kernels older than 4.18
when this feature was introduced.

However, it also takes effect for the io_uring backend, since it is handled
in common code rather than backend specific code.

Fix that by transporting the flag all the way to aio_storage_context,
via a new reactor_backend_config struct, and always issuing fdatasync
via the backend. aio_storage_context is then responsible for using
the syscall thread if the switch is on (or if an older kernel is detected).

We lose the ability to see what's actually selected via --help-seastar,
but that's not very important.

Tested by running file_io_test with different reactor backend and
--aio-fsync combinations and looking at strace output.
…mplementation

Make new class reactor_backend_epoll_base a pure epoll implementation
without any linux-aio support (and any storage support at all). This
helps determinstic replay debuggers such as `rr` work, since they don't
support aio.
Implement read/readv/write/writev by bouncing to the syscall thread.
This is slow, especially as there is only one syscall thread, but
the intent here is functionality testing in environments that don't
support aio, not performance.

Since the syscall thread submit function allocates, run it in a
critical section so that fstream_test succeeds.
When we detect if we have enough aio slots, we can also detect
that aio does not work at all and return ENOSYS. This happens
in deterministic replay environments that don't support io_setup.
In deterministic environments perf_event_open works, but mmaping
the file descriptor fails. Don't abort; throw and the system will
fall back to another method of backing the CPU stall detector.
@avikivity
Copy link
Member Author

v4:

  • new patch: reactor: rationalize --aio-fsync switch
  • implement fdatasync in pure-epoll backend

@travisdowns
Copy link
Contributor

travisdowns commented Jan 20, 2026

Is there work being done to have a deterministic mode for seastar? Does it already work in rr with this patch? We have done a bit of brainstorming on this our end, perhaps we can swap notes.

@avikivity
Copy link
Member Author

avikivity commented Jan 20, 2026

Is there work being done to have a deterministic mode for seastar?

No, but this gets us closer.

Does it already work in rr with this patch?

I managed to crash rr and immediately stopped looking in order not to get discouraged.

We have done a bit of brainstorming on this our end, perhaps we can swap notes.

If we swap notes, you will not have any notes.

But yes, there is potential here, if we manage to tame smp::submit_to().

* needs the _signals._signal_handlers map to be initialized.
*/
_backend = rbs.create(*this);
_backend = rbs.create(*this, backend_config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backends (including aio_storage_context) already have access to reactor and its config, no need to add more configs

aio_storage_context::fdatasync_via_syscall_thread(int fd, kernel_completion* desc) {
// complete_with below satisfies the original promise, so it is safe to
// ignore the returned future.
(void)std::invoke([] (reactor& r, int fd, kernel_completion* desc) -> future<> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ties aio_storage_context back to reactor even tighter

Can, instead, this be resolved on reactor->backend level with the help of virtual future<> reactor_backend::fdatasync(int fd) = 0 overridden by backends. AIO/epoll backends would call the nowadays aio/thread-pool fdatasync, uring would just submit the CB, and the epoll-pure would unconditionally call the thread-pool version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants