-
Notifications
You must be signed in to change notification settings - Fork 1.7k
reactor: add pure-epoll backend #3208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
pollable_fd::write_some() actually translates to send(), which works only on sockets, not all file descriptors. If we want to support write() on non-sockets, we'll need methods that plumb down to read/write. To avoid confusion, rename the current methods to recv/send to make it clear what they are actually implemented with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a pure-epoll reactor backend for Seastar that doesn't use linux-aio, enabling support for deterministic replay environments like rr that don't support AIO APIs. The implementation uses the syscall thread for disk I/O instead of linux-aio or io_uring, and improves tolerance to ENOSYS errors on platforms that don't support these syscalls.
Changes:
- Introduced
reactor_backend_epoll_baseas a base class by refactoring common functionality fromreactor_backend_epoll - Added
reactor_backend_epoll_purebackend that uses syscall threads for disk I/O - Separated
write_*methods (for regular file descriptors) fromsend_*methods (for network sockets) in the pollable_fd API for better semantic clarity
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/core/reactor_backend.hh | Introduces reactor_backend_epoll_base base class and reactor_backend_epoll_pure derived class; refactors class hierarchy |
| src/core/reactor_backend.cc | Implements pure-epoll backend with syscall thread I/O, refactors epoll backend to use base class, adds ENOSYS error tolerance |
| src/core/reactor.cc | Adds write_all/do_write/do_writev methods for regular file descriptors, renames socket methods to send/recv, improves error handling |
| include/seastar/core/reactor.hh | Adds friend declarations for new backend classes and method declarations for write operations |
| include/seastar/core/internal/pollable_fd.hh | Separates write_* API (file descriptors) from send_* API (network sockets), renames msghdr methods |
| src/util/process.cc | Updates pipe implementation to use pollable_fd and new write_all API instead of io_queue |
| src/net/posix-stack.cc | Updates network code to use renamed send_all/recv/send methods for network operations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/core/reactor_backend.cc
Outdated
| auto r = _r._io_sink.drain([this] (const internal::io_request& req, io_completion* completion) { | ||
| memory::scoped_critical_alloc_section _; | ||
| using o = internal::io_request::operation; | ||
| // The returned future will be used to satify the promise in io_completion, |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: "satify" should be "satisfy".
| // The returned future will be used to satify the promise in io_completion, | |
| // The returned future will be used to satisfy the promise in io_completion, |
e2c69d3 to
8a0a5f4
Compare
|
v2: fixed typo in comment pointed out by @copilot |
|
Ten years ago (!) I opened an issue asking for this - #66. |
nyh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I've waited for this for ten years (#66) :-)
How did you test this patch, and how do we test all the different reactor backends going forward? Can we run some set of tests for all the existing reactor backends, for example? Do we do anything of this sort?
| reactor::do_write(pollable_fd_state& fd, const void* buffer, size_t len) { | ||
| return writeable(fd).then([this, &fd, buffer, len] () mutable { | ||
| auto r = fd.fd.write(buffer, len); | ||
| if (!r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do you expect write() to return 0, and not an error? Why is calling the function again immediately the right thing to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When do you expect write() to return 0, and not an error?
The kernel buffer full. writeable() return a ready future due to bad speculation.
Why is calling the function again immediately the right thing to do?
writeable() will return a future that is ready when the buffer is not full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm misunderstanding what I'm seeing here. This is a blocking write to the filesystem, not a socket write, right? So why would the kernel return 0 bytes instead of waiting until it can write at least some bytes? Did you see this behavior in practice, or documented somewhere, or it's just defensive programming just in case it ever happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a non-blocking write to a non-disk, non-socket file descriptor (pollable_fd -> non-disk).
The non-socket pollable fds we support are eventfd (for inter-thread communication) and pipes (for communicating with subprocesses).
src/core/reactor.cc
Outdated
|
|
||
| future<> | ||
| reactor::write_all(pollable_fd_state& fd, const void* buffer, size_t len) { | ||
| SEASTAR_ASSERT(len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this assert necessary? Why is writing 0 bytes not a perfectly valid thing to do (it will write nothing)?
I see below that for len==0, you won't even call this function. So this assert will never trigger. But, still, why even bother with it - it's still a useless if(). It seems to me that write_all_part will work fine even if given len=0. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary.
| } | ||
| return make_ready_future(); | ||
| }); | ||
| return _fd.write_all(buf.get(), buf_size).then([buf = std::move(buf)] {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not experienced enough with how this works, so I want to ask:
When aio is used, does the new code continue to do the same thing it did in the past (using aio)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it will use write(2). Similar to how socket writes use send*(2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. If we use the old "epoll" backend which uses (right?) classic AIO, did something change? Before it didn't use write()? What did it use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch doesn't change anything. It adds some methods.
The following patch changes pipe writes from using our disk write mechanism (which happened to work with linux-aio but won't with pwrite()) to these new methods.
The trigger is that epoll_pure will issue disk writes with pwrite(). These happen not to work with pipes.
| // (such as timers, signals, inter-thread notifications) into file descriptors | ||
| // using mechanisms like timerfd, signalfd and eventfd respectively. | ||
| class reactor_backend_epoll : public reactor_backend { | ||
| class reactor_backend_epoll_base : public reactor_backend { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be useful to explain in the comment why this is the epoll "base", what doesn't it do, and what does the one without "base" in its name add to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
| friend struct hrtimer_aio_completion; | ||
| friend class reactor_backend_epoll_base; | ||
| friend class reactor_backend_epoll; | ||
| friend class reactor_backend_epoll_pure; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the word "pure" doesn't quite explain what this backend does. It isn't just a more "pure" use of epoll, it's actually entirely dependent on worker threads (#66). I wonder if it should be called something like epoll_thread or something. Or maybe just a comment is needed to explain what it is - whatever name we use for it.
|
Oh, and I think the name "pure-epoll" is a bit misleading, since a big part of how it works is those worker threads. Also "pure epoll" sounds somehow better than unpure epoll, while the new reactor backend will probably be the least efficient we have and shouldn't be used unless you must, right? |
[1] ? |
Add a way to call the write() syscall asynchronously, for file descriptors that aren't sockets, like pipes. We already had read().
Pipes aren't filesystem files, and therefore don't support pwritev(), which is what we use for filesystem files. It happened to work, likely because linux-aio smoothed things over for us, but won't work with the real pwrite(), which we'll use when linux-aio is not available.
8a0a5f4 to
50e475d
Compare
Fixed. |
|
v3:
|
epoll isn't about how to issue I/O (we don't write to sockets with epoll, we use sendmsg). It's about the readiness/completion mechanism. reactor_backend_epoll_pure uses only epoll for readiness/completion |
That's true. I guess I find it confusing that the backend is named after only the readiness part (and as you saw in the old "epoll" one, sometimes just half of the readiness part), while both readiness and issue is important to understand how a backend works. As you noted, in the future you'll want another one "tbd", which will use io_uring for readiness but also for issue - how will you call this new backend without mentioning the issue? I thought it makes sense to include both the names of the readiness mechanism and the issue mechanism in the backend's name, especially in this new backend's case where the issue mechanism is such a critical part of the implementation (and what makes it so inefficient and unrecommended except in specific scenarios where you need it). But you're right, none of the existing backends mention their issue mechanism in their names, so we don't have to start now. But I do think we need clear documentation and/or comments on what these different backends are. I don't think that a year from now, it will be easy to guess what "epoll" is vs "epoll_pure", and which one to use when. |
It's really about what's special about the backend. pure epoll is special in that it doesn't use linux-aio. epoll (which is really mixed epoll/aio) is where things started, so it was named so to distinguish it from aio. aio introduced pure aio for socket readiness and disk issue. io_uring doesn't use io_uring for non-disk issue, but since there's only one io_uring backend so far, it wasn't necessary to distinguish it.
The names are all historical and made sense at a point in time when they were introduced. They are not systematic.
The default should be fine. |
I set the backend in ~/.config/seastar and ran the test suite.
Seems a waste to do it for every run, most don't touch this area.
No. |
| auto& req_writev = req.as<o::writev>(); | ||
| return ::pwritev(req_writev.fd, req_writev.iovec, req_writev.iov_len, req_writev.pos); | ||
| } | ||
| default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should step on fsync_io_desc's request submitted from reactor::fdatasync()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I guess I didn't see this because I have unsafe-bypass-fsync set in my config. I'll reproduce and fix.
But that's not quite so. Both sendmsg/recvmsg of uring backend only do syscalls for socket if they see the speculation marks on pollable fd, otherwise the issue uring CBs |
Ah, forgot. So it's more nuanced. |
The --aio-fsync tells the reactor whether to use linux-aio for fdatasync operations. Its main use is to test support for kernels older than 4.18 when this feature was introduced. However, it also takes effect for the io_uring backend, since it is handled in common code rather than backend specific code. Fix that by transporting the flag all the way to aio_storage_context, via a new reactor_backend_config struct, and always issuing fdatasync via the backend. aio_storage_context is then responsible for using the syscall thread if the switch is on (or if an older kernel is detected). We lose the ability to see what's actually selected via --help-seastar, but that's not very important. Tested by running file_io_test with different reactor backend and --aio-fsync combinations and looking at strace output.
…mplementation Make new class reactor_backend_epoll_base a pure epoll implementation without any linux-aio support (and any storage support at all). This helps determinstic replay debuggers such as `rr` work, since they don't support aio.
Implement read/readv/write/writev by bouncing to the syscall thread. This is slow, especially as there is only one syscall thread, but the intent here is functionality testing in environments that don't support aio, not performance. Since the syscall thread submit function allocates, run it in a critical section so that fstream_test succeeds.
When we detect if we have enough aio slots, we can also detect that aio does not work at all and return ENOSYS. This happens in deterministic replay environments that don't support io_setup.
In deterministic environments perf_event_open works, but mmaping the file descriptor fails. Don't abort; throw and the system will fall back to another method of backing the CPU stall detector.
50e475d to
d461019
Compare
|
v4:
|
|
Is there work being done to have a deterministic mode for seastar? Does it already work in |
No, but this gets us closer.
I managed to crash
If we swap notes, you will not have any notes. But yes, there is potential here, if we manage to tame smp::submit_to(). |
| * needs the _signals._signal_handlers map to be initialized. | ||
| */ | ||
| _backend = rbs.create(*this); | ||
| _backend = rbs.create(*this, backend_config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backends (including aio_storage_context) already have access to reactor and its config, no need to add more configs
| aio_storage_context::fdatasync_via_syscall_thread(int fd, kernel_completion* desc) { | ||
| // complete_with below satisfies the original promise, so it is safe to | ||
| // ignore the returned future. | ||
| (void)std::invoke([] (reactor& r, int fd, kernel_completion* desc) -> future<> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ties aio_storage_context back to reactor even tighter
Can, instead, this be resolved on reactor->backend level with the help of virtual future<> reactor_backend::fdatasync(int fd) = 0 overridden by backends. AIO/epoll backends would call the nowadays aio/thread-pool fdatasync, uring would just submit the CB, and the epoll-pure would unconditionally call the thread-pool version?
This series adds a reactor backend that does not use linux-aio, even for
disk I/O. Instead, it uses the syscall thread for disk I/O.
The new backend is meant for deterministic replay environments such as
rr[1]that don't support aio APIs like linux-aio or io_uring. They emulate threading
by scheduling one thread at a time, so the limitations of the syscall thread aren't
material.
The series also improves tolerance to ENOSYS on such platforms.
[1] https://rr-project.org/