reactor: add pure-epoll backend #3208

avikivity · 2026-01-19T14:55:53Z

This series adds a reactor backend that does not use linux-aio, even for
disk I/O. Instead, it uses the syscall thread for disk I/O.

The new backend is meant for deterministic replay environments such as rr[1]
that don't support aio APIs like linux-aio or io_uring. They emulate threading
by scheduling one thread at a time, so the limitations of the syscall thread aren't
material.

The series also improves tolerance to ENOSYS on such platforms.

[1] https://rr-project.org/

pollable_fd::write_some() actually translates to send(), which works only on sockets, not all file descriptors. If we want to support write() on non-sockets, we'll need methods that plumb down to read/write. To avoid confusion, rename the current methods to recv/send to make it clear what they are actually implemented with.

Copilot

Pull request overview

This PR adds a pure-epoll reactor backend for Seastar that doesn't use linux-aio, enabling support for deterministic replay environments like rr that don't support AIO APIs. The implementation uses the syscall thread for disk I/O instead of linux-aio or io_uring, and improves tolerance to ENOSYS errors on platforms that don't support these syscalls.

Changes:

Introduced reactor_backend_epoll_base as a base class by refactoring common functionality from reactor_backend_epoll
Added reactor_backend_epoll_pure backend that uses syscall threads for disk I/O
Separated write_* methods (for regular file descriptors) from send_* methods (for network sockets) in the pollable_fd API for better semantic clarity

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/core/reactor_backend.hh	Introduces `reactor_backend_epoll_base` base class and `reactor_backend_epoll_pure` derived class; refactors class hierarchy
src/core/reactor_backend.cc	Implements pure-epoll backend with syscall thread I/O, refactors epoll backend to use base class, adds ENOSYS error tolerance
src/core/reactor.cc	Adds `write_all`/`do_write`/`do_writev` methods for regular file descriptors, renames socket methods to `send`/`recv`, improves error handling
include/seastar/core/reactor.hh	Adds friend declarations for new backend classes and method declarations for write operations
include/seastar/core/internal/pollable_fd.hh	Separates `write_` API (file descriptors) from `send_` API (network sockets), renames msghdr methods
src/util/process.cc	Updates pipe implementation to use `pollable_fd` and new `write_all` API instead of io_queue
src/net/posix-stack.cc	Updates network code to use renamed `send_all`/`recv`/`send` methods for network operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-19T14:59:19Z

src/core/reactor_backend.cc

+    auto r = _r._io_sink.drain([this] (const internal::io_request& req, io_completion* completion) {
+        memory::scoped_critical_alloc_section _;
+        using o = internal::io_request::operation;
+        // The returned future will be used to satify the promise in io_completion,


Typo in comment: "satify" should be "satisfy".

Suggested change

// The returned future will be used to satify the promise in io_completion,

// The returned future will be used to satisfy the promise in io_completion,

avikivity · 2026-01-19T15:07:18Z

v2: fixed typo in comment pointed out by @copilot

nyh · 2026-01-19T15:21:13Z

Ten years ago (!) I opened an issue asking for this - #66.
After six years, I gave up on the idea and closed the issue.
I guess as they say, better late than never!

nyh

Looks good. I've waited for this for ten years (#66) :-)

How did you test this patch, and how do we test all the different reactor backends going forward? Can we run some set of tests for all the existing reactor backends, for example? Do we do anything of this sort?

nyh · 2026-01-19T15:35:37Z

src/core/reactor.cc

+reactor::do_write(pollable_fd_state& fd, const void* buffer, size_t len) {
+    return writeable(fd).then([this, &fd, buffer, len] () mutable {
+        auto r = fd.fd.write(buffer, len);
+        if (!r) {


When do you expect write() to return 0, and not an error? Why is calling the function again immediately the right thing to do?

When do you expect write() to return 0, and not an error?

The kernel buffer full. writeable() return a ready future due to bad speculation.

Why is calling the function again immediately the right thing to do?

writeable() will return a future that is ready when the buffer is not full.

Maybe I'm misunderstanding what I'm seeing here. This is a blocking write to the filesystem, not a socket write, right? So why would the kernel return 0 bytes instead of waiting until it can write at least some bytes? Did you see this behavior in practice, or documented somewhere, or it's just defensive programming just in case it ever happens?

It is a non-blocking write to a non-disk, non-socket file descriptor (pollable_fd -> non-disk).

The non-socket pollable fds we support are eventfd (for inter-thread communication) and pipes (for communicating with subprocesses).

nyh · 2026-01-19T15:38:32Z

src/core/reactor.cc

+
+future<>
+reactor::write_all(pollable_fd_state& fd, const void* buffer, size_t len) {
+    SEASTAR_ASSERT(len);


Why is this assert necessary? Why is writing 0 bytes not a perfectly valid thing to do (it will write nothing)?
I see below that for len==0, you won't even call this function. So this assert will never trigger. But, still, why even bother with it - it's still a useless if(). It seems to me that write_all_part will work fine even if given len=0. Or am I missing something?

It's not necessary.

nyh · 2026-01-19T15:43:11Z

src/util/process.cc

-                }
-                return make_ready_future();
-            });
+        return _fd.write_all(buf.get(), buf_size).then([buf = std::move(buf)] {});


I'm not experienced enough with how this works, so I want to ask:
When aio is used, does the new code continue to do the same thing it did in the past (using aio)?

No, it will use write(2). Similar to how socket writes use send*(2).

I don't understand. If we use the old "epoll" backend which uses (right?) classic AIO, did something change? Before it didn't use write()? What did it use?

This patch doesn't change anything. It adds some methods.

The following patch changes pipe writes from using our disk write mechanism (which happened to work with linux-aio but won't with pwrite()) to these new methods.

The trigger is that epoll_pure will issue disk writes with pwrite(). These happen not to work with pipes.

nyh · 2026-01-19T15:51:11Z

src/core/reactor_backend.hh

 // (such as timers, signals, inter-thread notifications) into file descriptors
 // using mechanisms like timerfd, signalfd and eventfd respectively.
-class reactor_backend_epoll : public reactor_backend {
+class reactor_backend_epoll_base : public reactor_backend {


I think it will be useful to explain in the comment why this is the epoll "base", what doesn't it do, and what does the one without "base" in its name add to it.

nyh · 2026-01-19T15:55:05Z

include/seastar/core/reactor.hh

    friend struct hrtimer_aio_completion;
    friend class reactor_backend_epoll_base;
    friend class reactor_backend_epoll;
+    friend class reactor_backend_epoll_pure;


I think the word "pure" doesn't quite explain what this backend does. It isn't just a more "pure" use of epoll, it's actually entirely dependent on worker threads (#66). I wonder if it should be called something like epoll_thread or something. Or maybe just a comment is needed to explain what it is - whatever name we use for it.

nyh · 2026-01-19T16:04:43Z

Oh, and I think the name "pure-epoll" is a bit misleading, since a big part of how it works is those worker threads. Also "pure epoll" sounds somehow better than unpure epoll, while the new reactor backend will probably be the least efficient we have and shouldn't be used unless you must, right?
As I noted in #66, qemu for example calls this technique aio=threads, focusing on the thread aspect. I wonder if we shouldn't use the word "threads" in the name of our reactor as well. Maybe epoll-threads?
But it's your call. I don't feel I understand this well enough to definitely recommend any name.

mykaul · 2026-01-19T16:09:14Z

The new backend is meant for deterministic replay environments such as rr[1]

[1] ?

Add a way to call the write() syscall asynchronously, for file descriptors that aren't sockets, like pipes. We already had read().

Pipes aren't filesystem files, and therefore don't support pwritev(), which is what we use for filesystem files. It happened to work, likely because linux-aio smoothed things over for us, but won't work with the real pwrite(), which we'll use when linux-aio is not available.

avikivity · 2026-01-19T16:15:19Z

The new backend is meant for deterministic replay environments such as rr[1]

[1] ?

Fixed.

avikivity · 2026-01-19T16:16:22Z

v3:

removed bad assert in write_all()
added comments to the three epoll-related reactor backend classes

avikivity · 2026-01-19T16:20:39Z

Oh, and I think the name "pure-epoll" is a bit misleading, since a big part of how it works is those worker threads. Also "pure epoll" sounds somehow better than unpure epoll, while the new reactor backend will probably be the least efficient we have and shouldn't be used unless you must, right? As I noted in #66, qemu for example calls this technique aio=threads, focusing on the thread aspect. I wonder if we shouldn't use the word "threads" in the name of our reactor as well. Maybe epoll-threads? But it's your call. I don't feel I understand this well enough to definitely recommend any name.

epoll isn't about how to issue I/O (we don't write to sockets with epoll, we use sendmsg). It's about the readiness/completion mechanism.

reactor_backend_epoll_pure uses only epoll for readiness/completion
reactor_backend_epoll uses epoll for non-disk readiness and linux-aio for disk issue and completions
reactor_backend_aio uses linux-aio for readiness/completion and disk issue, while issuing non-disk I/O via syscalls
reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls
reactor_backend_tbd will use io_uring for everything, bypassing syscalls for almost everything

nyh · 2026-01-19T16:35:48Z

epoll isn't about how to issue I/O (we don't write to sockets with epoll, we use sendmsg). It's about the readiness/completion mechanism.

reactor_backend_epoll_pure uses only epoll for readiness/completion reactor_backend_epoll uses epoll for non-disk readiness and linux-aio for disk issue and completions reactor_backend_aio uses linux-aio for readiness/completion and disk issue, while issuing non-disk I/O via syscalls reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls reactor_backend_tbd will use io_uring for everything, bypassing syscalls for almost everything

That's true. I guess I find it confusing that the backend is named after only the readiness part (and as you saw in the old "epoll" one, sometimes just half of the readiness part), while both readiness and issue is important to understand how a backend works. As you noted, in the future you'll want another one "tbd", which will use io_uring for readiness but also for issue - how will you call this new backend without mentioning the issue?

I thought it makes sense to include both the names of the readiness mechanism and the issue mechanism in the backend's name, especially in this new backend's case where the issue mechanism is such a critical part of the implementation (and what makes it so inefficient and unrecommended except in specific scenarios where you need it).

But you're right, none of the existing backends mention their issue mechanism in their names, so we don't have to start now. But I do think we need clear documentation and/or comments on what these different backends are. I don't think that a year from now, it will be easy to guess what "epoll" is vs "epoll_pure", and which one to use when.

avikivity · 2026-01-19T16:41:04Z

epoll isn't about how to issue I/O (we don't write to sockets with epoll, we use sendmsg). It's about the readiness/completion mechanism.
reactor_backend_epoll_pure uses only epoll for readiness/completion reactor_backend_epoll uses epoll for non-disk readiness and linux-aio for disk issue and completions reactor_backend_aio uses linux-aio for readiness/completion and disk issue, while issuing non-disk I/O via syscalls reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls reactor_backend_tbd will use io_uring for everything, bypassing syscalls for almost everything

That's true. I guess I find it confusing that the backend is named after only the readiness part (and as you saw in the old "epoll" one, sometimes just half of the readiness part), while both readiness and issue is important to understand how a backend works. As you noted, in the future you'll want another one "tbd", which will use io_uring for readiness but also for issue - how will you call this new backend without mentioning the issue?

It's really about what's special about the backend. pure epoll is special in that it doesn't use linux-aio. epoll (which is really mixed epoll/aio) is where things started, so it was named so to distinguish it from aio. aio introduced pure aio for socket readiness and disk issue. io_uring doesn't use io_uring for non-disk issue, but since there's only one io_uring backend so far, it wasn't necessary to distinguish it.

I thought it makes sense to include both the names of the readiness mechanism and the issue mechanism in the backend's name, especially in this new backend's case where the issue mechanism is such a critical part of the implementation (and what makes it so inefficient and unrecommended except in specific scenarios where you need it).

The names are all historical and made sense at a point in time when they were introduced. They are not systematic.

But you're right, none of the existing backends mention their issue mechanism in their names, so we don't have to start now. But I do think we need clear documentation and/or comments on what these different backends are. I don't think that a year from now, it will be easy to guess what "epoll" is vs "epoll_pure", and which one to use when.

The default should be fine.

avikivity · 2026-01-19T18:11:02Z

Looks good. I've waited for this for ten years (#66) :-)

How did you test this patch, and how do we test all the different reactor backends going forward?

I set the backend in ~/.config/seastar and ran the test suite.

Can we run some set of tests for all the existing reactor backends, for example?

Seems a waste to do it for every run, most don't touch this area.

Do we do anything of this sort?

No.

xemul · 2026-01-20T08:34:42Z

src/core/reactor_backend.cc

+                auto& req_writev = req.as<o::writev>();
+                return ::pwritev(req_writev.fd, req_writev.iovec, req_writev.iov_len, req_writev.pos);
+            }
+            default:


This should step on fsync_io_desc's request submitted from reactor::fdatasync()

Ah. I guess I didn't see this because I have unsafe-bypass-fsync set in my config. I'll reproduce and fix.

xemul · 2026-01-20T13:42:35Z

reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls

But that's not quite so. Both sendmsg/recvmsg of uring backend only do syscalls for socket if they see the speculation marks on pollable fd, otherwise the issue uring CBs

avikivity · 2026-01-20T14:55:16Z

reactor_backend_io_uring uses io_uring for readiness/completion and disk issue, while issuing non-disk I/O via syscalls

But that's not quite so. Both sendmsg/recvmsg of uring backend only do syscalls for socket if they see the speculation marks on pollable fd, otherwise the issue uring CBs

Ah, forgot. So it's more nuanced.

The --aio-fsync tells the reactor whether to use linux-aio for fdatasync operations. Its main use is to test support for kernels older than 4.18 when this feature was introduced. However, it also takes effect for the io_uring backend, since it is handled in common code rather than backend specific code. Fix that by transporting the flag all the way to aio_storage_context, via a new reactor_backend_config struct, and always issuing fdatasync via the backend. aio_storage_context is then responsible for using the syscall thread if the switch is on (or if an older kernel is detected). We lose the ability to see what's actually selected via --help-seastar, but that's not very important. Tested by running file_io_test with different reactor backend and --aio-fsync combinations and looking at strace output.

…mplementation Make new class reactor_backend_epoll_base a pure epoll implementation without any linux-aio support (and any storage support at all). This helps determinstic replay debuggers such as `rr` work, since they don't support aio.

Implement read/readv/write/writev by bouncing to the syscall thread. This is slow, especially as there is only one syscall thread, but the intent here is functionality testing in environments that don't support aio, not performance. Since the syscall thread submit function allocates, run it in a critical section so that fstream_test succeeds.

When we detect if we have enough aio slots, we can also detect that aio does not work at all and return ENOSYS. This happens in deterministic replay environments that don't support io_setup.

In deterministic environments perf_event_open works, but mmaping the file descriptor fails. Don't abort; throw and the system will fall back to another method of backing the CPU stall detector.

avikivity · 2026-01-20T16:14:53Z

v4:

new patch: reactor: rationalize --aio-fsync switch
implement fdatasync in pure-epoll backend

travisdowns · 2026-01-20T16:27:46Z

Is there work being done to have a deterministic mode for seastar? Does it already work in rr with this patch? We have done a bit of brainstorming on this our end, perhaps we can swap notes.

avikivity · 2026-01-20T16:33:19Z

Is there work being done to have a deterministic mode for seastar?

No, but this gets us closer.

Does it already work in rr with this patch?

I managed to crash rr and immediately stopped looking in order not to get discouraged.

We have done a bit of brainstorming on this our end, perhaps we can swap notes.

If we swap notes, you will not have any notes.

But yes, there is potential here, if we manage to tame smp::submit_to().

xemul · 2026-01-21T07:22:48Z

src/core/reactor.cc

     * needs the _signals._signal_handlers map to be initialized.
     */
-    _backend = rbs.create(*this);
+    _backend = rbs.create(*this, backend_config);


Backends (including aio_storage_context) already have access to reactor and its config, no need to add more configs

xemul · 2026-01-21T07:29:22Z

src/core/reactor_backend.cc

+aio_storage_context::fdatasync_via_syscall_thread(int fd, kernel_completion* desc) {
+    // complete_with below satisfies the original promise, so it is safe to
+    // ignore the returned future.
+    (void)std::invoke([] (reactor& r, int fd, kernel_completion* desc) -> future<> {


This ties aio_storage_context back to reactor even tighter

Can, instead, this be resolved on reactor->backend level with the help of virtual future<> reactor_backend::fdatasync(int fd) = 0 overridden by backends. AIO/epoll backends would call the nowadays aio/thread-pool fdatasync, uring would just submit the CB, and the epoll-pure would unconditionally call the thread-pool version?

avikivity requested a review from Copilot January 19, 2026 14:56

Copilot started reviewing on behalf of avikivity January 19, 2026 14:56 View session

Copilot AI reviewed Jan 19, 2026

View reviewed changes

avikivity force-pushed the epoll-noaio branch from e2c69d3 to 8a0a5f4 Compare January 19, 2026 15:06

nyh approved these changes Jan 19, 2026

View reviewed changes

nyh mentioned this pull request Jan 19, 2026

Support other forms of AIO #66

Closed

avikivity added 2 commits January 19, 2026 18:10

reactor: add write() methods to pollable_fd and friends

33e8b49

Add a way to call the write() syscall asynchronously, for file descriptors that aren't sockets, like pipes. We already had read().

avikivity force-pushed the epoll-noaio branch from 8a0a5f4 to 50e475d Compare January 19, 2026 16:14

xemul reviewed Jan 20, 2026

View reviewed changes

avikivity added 5 commits January 20, 2026 17:57

reactor_backend: tolerate ENOSYS in io_setup during aio detection

dc1cdf4

When we detect if we have enough aio slots, we can also detect that aio does not work at all and return ENOSYS. This happens in deterministic replay environments that don't support io_setup.

reactor: fail more gracefully when perf_event fd can't be mmaped

d461019

In deterministic environments perf_event_open works, but mmaping the file descriptor fails. Don't abort; throw and the system will fall back to another method of backing the CPU stall detector.

avikivity force-pushed the epoll-noaio branch from 50e475d to d461019 Compare January 20, 2026 16:14

xemul reviewed Jan 21, 2026

View reviewed changes

	// The returned future will be used to satify the promise in io_completion,
	// The returned future will be used to satisfy the promise in io_completion,

reactor: add pure-epoll backend #3208

Are you sure you want to change the base?

reactor: add pure-epoll backend #3208

Conversation

avikivity commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

avikivity commented Jan 19, 2026

Uh oh!

nyh commented Jan 19, 2026

Uh oh!

nyh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nyh commented Jan 19, 2026

Uh oh!

mykaul commented Jan 19, 2026

Uh oh!

avikivity commented Jan 19, 2026

Uh oh!

avikivity commented Jan 19, 2026

Uh oh!

avikivity commented Jan 19, 2026

Uh oh!

nyh commented Jan 19, 2026

Uh oh!

avikivity commented Jan 19, 2026

Uh oh!

avikivity commented Jan 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xemul commented Jan 20, 2026

Uh oh!

avikivity commented Jan 20, 2026

Uh oh!

avikivity commented Jan 20, 2026

Uh oh!

travisdowns commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avikivity commented Jan 19, 2026 •

edited

Loading

travisdowns commented Jan 20, 2026 •

edited

Loading

avikivity commented Jan 20, 2026 •

edited

Loading