Introduce NUMA and SMT awareness to the asymmetric io_uring reactor backend#3405
Introduce NUMA and SMT awareness to the asymmetric io_uring reactor backend#3405witek-formanski wants to merge 15 commits into
Conversation
Members are private already by default. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
This value is related to io_uring (specifically it's submission queue length), not to specific backend implementation. In the future this value will be shared among more than one backend using io_uring. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
This class will be a base for all reactor backends using io_uring. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
Only methods that are not moved are I/O methods and `get_backend_name`. We're also deriving reactor_backend_uring from reactor_backend_uring_base. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
…_base The implementations of base completion classes were moved from methods to the base class, as they would be the same in both asymmetric and symmetric implementations. Base completion classes which would end as identical classes, such as the ones for recv_some and read_some were merged into a single class. The specific implementations differ only by the inclusion/exclusion of fast track for symmetric/asymmetric backends, respectively. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
This method is overriden in the asymmetric class in a later commit that introduces buffer ring support. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
Introduce a new class - reactor_backend_asymmetric_uring. This class represents an io_uring-based backend, with networking I/O being handled by io_uring threads living on separate vCPUs, as implemented in a later patch. The "fast track" - speculation on FDs has been removed when compared to the implementation of reactor_backend_uring. The fast track bypassed the backend mechanism completely, executing given operation, e.g. recvmsg() immediately, if the speculation returned true. This contradicts the idea of moving the I/O to different vCPUs, which is the core idea of this backend. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
This commit also introduces detection of the availability of asymmetric io_uring reactor backend. Here the conditions for its availability are the same as for symmetric io_uring backend. It is changed in a later commit. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
A given asymmetric io_uring backend will either be provided with a ready io_uring, or a ring_fd, so it can create a new instance and use IORING_SETUP_ATTACH_WQ to attach it to an existing one. The variants include std::monostate, to avoid having to pass a value to other backends. To avoid changing struct size depending on conditional compilation, the io_uring is stored as std::any Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
There are two functions - `try_create_base_asymmetric_uring` which creates a uring with workers on a specific CPU and `try_create_attached_asymmetric_uring` which will have its workers attached to an existing instance. Both rely on `try_create_asymmetric_uring_impl` which is largely copied from `try_create_uring`. To allow the backend to create an attached uring instance or use a ready one, `try_create_asymmetric_uring` takes a variant of these two possibilities. Those methods are created in order to be used in the future during backend creation. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
Introduce CLI option to specify async workers cpuset The new option (async-workers-cpuset) is used to cpecify the CPUs on which workers will be spawned. The recommended way to use this option is to specify it alongside --cpuset. When specified with neither --cpuset nor --smp, it will remove the specified cpus from shard allocation. When --smp or --cpuset are specified, it is possible that worker and app cpusets overlap. Add asymmetric uring backend creation to smp::configure Determine the async worker and app cores cpumasks. Implement uring allocation in smp::configure For shards which represent a group of shards using a singular networking core, allocate the uring in smp::configure. This allows uring_fd passing to other shards. Add a barrier to ensure all uring_fds are existant. Use the new uring creation method in reactor_backend_asymmetric_uring constructor. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
…licitly If user specifies which cpus to use for async workers, but does not specify cpuset for smp, then we should prevent these 2 from overlap, as this might resolve in performance degradation. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
Add a detect_asymmetric_io_uring function, which checks whether an asymmetric uring can be created. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
6ee26ae to
2d80442
Compare
|
|
There was a problem hiding this comment.
Pull request overview
This PR is a follow-up to #3297 (asymmetric_io_uring backend) that introduces NUMA and SMT awareness when mapping app shards to networking (worker) cores. Instead of the simple shard % n round-robin assignment, the new uring::compute_assignments performs three passes: SMT-sibling preference, same-NUMA-node preference, then any remaining networking core, each pass picking the least-loaded candidate. Topology data is gathered from hwloc when available, with a degenerate single-NUMA fallback otherwise.
Changes:
- Add
compute_assignments(NUMA/SMT-aware shard→networking-core mapping) and plumb itsnuma_assignmentthroughsmp::configureto drive master/group-id selection per shard. - Introduce
async_workers_cpusetoption,async_worker_allocationhelper, overlap/overprovisioned validation, and a sharedmaster_uring_fdsvector synchronized via astd::barrier. - Refactor
reactor_backend_uringinto areactor_backend_uring_base+ per-backend final classes, share completion base classes, and addreactor_backend_asymmetric_uringusing attached SQPOLL rings.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| src/core/reactor.cc | Adds async-worker cpuset option/allocation, master uring orchestration with barrier, and per-shard reactor_config plumbing. |
| src/core/reactor_backend.hh | Declares uring:: helpers (try_create_*_asymmetric_uring, numa_assignment, compute_assignments) and shared constants. |
| src/core/reactor_backend.cc | Extracts reactor_backend_uring_base, adds asymmetric backend, hwloc-based NUMA/SMT-aware assignment algorithm. |
| include/seastar/core/reactor.hh | Adds async_worker_allocation struct and friend declarations for new backend classes. |
| include/seastar/core/reactor_config.hh | Adds compile_safe_io_uring alias and asymmetric_uring variant on reactor_config; declares async_workers_cpuset option. |
| include/seastar/core/internal/pollable_fd.hh | Adds friend declarations for the new backend classes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "CPUs to use (in cpuset(7) format) for backend's async workers." | ||
| " Only applicable to, and required by, the asymmetric_io_uring reactor backend (see --reactor-backend)." | ||
| " Note that if the --cpuset is not set, using --async-workers-cpuset will restrict the CPUs for the SMP.") |
There was a problem hiding this comment.
I'm not sure what exactly Copilot means here
There was a problem hiding this comment.
But still, this is relevant to #3297 I believe
There was a problem hiding this comment.
The original PR should not describe the NUMA/SMT-aware algorithm, as it is implemented in this PR as a followup.
There was a problem hiding this comment.
Please check if this PR needs any documentation/comments update about NUMA.
| return best; | ||
| }; | ||
|
|
||
| auto assign_pass = [&](auto&& candidate_selector) { |
There was a problem hiding this comment.
changed it to assignment_pass, but it doesn't seem to be much better, do you have any suggestions for a name?
There was a problem hiding this comment.
How about assign_least_loaded or maybe_assign_least_loaded?
^ This can stay here, but it should be removed when #3297 gets merged.
^ This should be last.
^ This should be the second.
^ This should be the first. |
|
Well, right, Copilot reviewed everything - please take a look and apply the fixes to #3297 if relevant. |
This commit changes the algorithm used to asign networking cores to app cores to take NUMA and SMT layout into account. The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node, and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>
2d80442 to
1175c41
Compare
|
This force-push addresses the previous feedback by changing the following:
@witek-formanski can you please mark relevant comments as resolved? And update the cover letter to: This is a follow-up for the #3297 pull request, adding a new feature to the Changes affect the algorithm used to asign networking cores to app cores, so that it takes NUMA and SMT layout into account. Changes were tested manually by mocking Closes #3315. |
|
|
||
| uring_assignments = compute_assignments(allocations, async_worker_cpus); | ||
|
|
||
| const bool is_master = (*uring_assignments)[0].is_master; |
There was a problem hiding this comment.
Assert that uring_assignments != nullptr && !(*uring_assignment).empty()
| using namespace uring; | ||
| const bool is_master = is_master_shard(i, async_worker_cpus); | ||
| const unsigned uring_group_id = get_uring_group_id(i, async_worker_cpus); | ||
| SEASTAR_ASSERT(uring_assignments); |
There was a problem hiding this comment.
We should assert that i-th element is there as well
There was a problem hiding this comment.
Or before the loop that the size of uring_assignments is as expected
| hwloc_successful_load = true; | ||
| } | ||
|
|
||
| hwloc_topology_destroy(topology); |
There was a problem hiding this comment.
We should destroy topology on failure
This is a follow-up for the #3297 pull request, adding a new feature to the
reactor_backend_asymmetric_uringin the last commit -Introduce NUMA and SMT awareness. To be merged only after the main PR is merged.Changes affect the algorithm used to asign networking cores to app cores, so that it takes NUMA and SMT layout into account.
The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node, and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step.
Changes were tested manually by mocking
hwloctopology in code. The tests produced expected assignment results.Closes #3315 .