Introduce NUMA and SMT awareness to the asymmetric io_uring reactor backend by witek-formanski · Pull Request #3405 · scylladb/seastar

witek-formanski · 2026-05-16T15:00:10Z

This is a follow-up for the #3297 pull request, adding a new feature to the reactor_backend_asymmetric_uring in the last commit - Introduce NUMA and SMT awareness. To be merged only after the main PR is merged.

Changes affect the algorithm used to asign networking cores to app cores, so that it takes NUMA and SMT layout into account.
The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node, and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step.

Changes were tested manually by mocking hwloc topology in code. The tests produced expected assignment results.

Closes #3315 .

Members are private already by default. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

This value is related to io_uring (specifically it's submission queue length), not to specific backend implementation. In the future this value will be shared among more than one backend using io_uring. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

This class will be a base for all reactor backends using io_uring. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

Only methods that are not moved are I/O methods and `get_backend_name`. We're also deriving reactor_backend_uring from reactor_backend_uring_base. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

…_base The implementations of base completion classes were moved from methods to the base class, as they would be the same in both asymmetric and symmetric implementations. Base completion classes which would end as identical classes, such as the ones for recv_some and read_some were merged into a single class. The specific implementations differ only by the inclusion/exclusion of fast track for symmetric/asymmetric backends, respectively. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

This method is overriden in the asymmetric class in a later commit that introduces buffer ring support. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

Introduce a new class - reactor_backend_asymmetric_uring. This class represents an io_uring-based backend, with networking I/O being handled by io_uring threads living on separate vCPUs, as implemented in a later patch. The "fast track" - speculation on FDs has been removed when compared to the implementation of reactor_backend_uring. The fast track bypassed the backend mechanism completely, executing given operation, e.g. recvmsg() immediately, if the speculation returned true. This contradicts the idea of moving the I/O to different vCPUs, which is the core idea of this backend. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

This commit also introduces detection of the availability of asymmetric io_uring reactor backend. Here the conditions for its availability are the same as for symmetric io_uring backend. It is changed in a later commit. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

A given asymmetric io_uring backend will either be provided with a ready io_uring, or a ring_fd, so it can create a new instance and use IORING_SETUP_ATTACH_WQ to attach it to an existing one. The variants include std::monostate, to avoid having to pass a value to other backends. To avoid changing struct size depending on conditional compilation, the io_uring is stored as std::any Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

There are two functions - `try_create_base_asymmetric_uring` which creates a uring with workers on a specific CPU and `try_create_attached_asymmetric_uring` which will have its workers attached to an existing instance. Both rely on `try_create_asymmetric_uring_impl` which is largely copied from `try_create_uring`. To allow the backend to create an attached uring instance or use a ready one, `try_create_asymmetric_uring` takes a variant of these two possibilities. Those methods are created in order to be used in the future during backend creation. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

Introduce CLI option to specify async workers cpuset The new option (async-workers-cpuset) is used to cpecify the CPUs on which workers will be spawned. The recommended way to use this option is to specify it alongside --cpuset. When specified with neither --cpuset nor --smp, it will remove the specified cpus from shard allocation. When --smp or --cpuset are specified, it is possible that worker and app cpusets overlap. Add asymmetric uring backend creation to smp::configure Determine the async worker and app cores cpumasks. Implement uring allocation in smp::configure For shards which represent a group of shards using a singular networking core, allocate the uring in smp::configure. This allows uring_fd passing to other shards. Add a barrier to ensure all uring_fds are existant. Use the new uring creation method in reactor_backend_asymmetric_uring constructor. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

…licitly If user specifies which cpus to use for async workers, but does not specify cpuset for smp, then we should prevent these 2 from overlap, as this might resolve in performance degradation. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

Add a detect_asymmetric_io_uring function, which checks whether an asymmetric uring can be created. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

witek-formanski · 2026-05-16T17:52:37Z

force-push to remove trailing whitespace

Copilot

Pull request overview

This PR is a follow-up to #3297 (asymmetric_io_uring backend) that introduces NUMA and SMT awareness when mapping app shards to networking (worker) cores. Instead of the simple shard % n round-robin assignment, the new uring::compute_assignments performs three passes: SMT-sibling preference, same-NUMA-node preference, then any remaining networking core, each pass picking the least-loaded candidate. Topology data is gathered from hwloc when available, with a degenerate single-NUMA fallback otherwise.

Changes:

Add compute_assignments (NUMA/SMT-aware shard→networking-core mapping) and plumb its numa_assignment through smp::configure to drive master/group-id selection per shard.
Introduce async_workers_cpuset option, async_worker_allocation helper, overlap/overprovisioned validation, and a shared master_uring_fds vector synchronized via a std::barrier.
Refactor reactor_backend_uring into a reactor_backend_uring_base + per-backend final classes, share completion base classes, and add reactor_backend_asymmetric_uring using attached SQPOLL rings.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
src/core/reactor.cc	Adds async-worker cpuset option/allocation, master uring orchestration with barrier, and per-shard reactor_config plumbing.
src/core/reactor_backend.hh	Declares `uring::` helpers (`try_create_*_asymmetric_uring`, `numa_assignment`, `compute_assignments`) and shared constants.
src/core/reactor_backend.cc	Extracts `reactor_backend_uring_base`, adds asymmetric backend, hwloc-based NUMA/SMT-aware assignment algorithm.
include/seastar/core/reactor.hh	Adds `async_worker_allocation` struct and friend declarations for new backend classes.
include/seastar/core/reactor_config.hh	Adds `compile_safe_io_uring` alias and `asymmetric_uring` variant on reactor_config; declares `async_workers_cpuset` option.
include/seastar/core/internal/pollable_fd.hh	Adds friend declarations for the new backend classes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mikolajbilski · 2026-05-22T20:05:16Z

+                "CPUs to use (in cpuset(7) format) for backend's async workers."
+                " Only applicable to, and required by, the asymmetric_io_uring reactor backend (see --reactor-backend)."
+                " Note that if the --cpuset is not set, using --async-workers-cpuset will restrict the CPUs for the SMP.")


I'm not sure what exactly Copilot means here

But still, this is relevant to #3297 I believe

The original PR should not describe the NUMA/SMT-aware algorithm, as it is implemented in this PR as a followup.

Please check if this PR needs any documentation/comments update about NUMA.

Deexie · 2026-05-18T08:37:27Z

+        return best;
+    };
+
+    auto assign_pass = [&](auto&& candidate_selector) {


The name is weird

changed it to assignment_pass, but it doesn't seem to be much better, do you have any suggestions for a name?

How about assign_least_loaded or maybe_assign_least_loaded?

Deexie · 2026-05-18T08:43:33Z

This is a follow-up for the #3297 pull request, adding a new feature to the reactor_backend_asymmetric_uring in the last commit - Introduce NUMA and SMT awareness. To be merged only after the main PR is merged.

^ This can stay here, but it should be removed when #3297 gets merged.

Closes #3315 .

^ This should be last.

Changes were tested manually by mocking hwloc topology in code. The tests produced expected assignment results.

^ This should be the second.

Changes affect the algorithm used to asign networking cores to app cores, so that it takes NUMA and SMT layout into account. The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node, and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step.

^ This should be the first.

Deexie · 2026-05-18T08:45:47Z

Well, right, Copilot reviewed everything - please take a look and apply the fixes to #3297 if relevant.

This commit changes the algorithm used to asign networking cores to app cores to take NUMA and SMT layout into account. The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node, and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step. Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

mikolajbilski · 2026-05-22T20:18:10Z

This force-push addresses the previous feedback by changing the following:

fix incorrect master_uring_fds size bug
add error handling to hwloc calls
change unnecessary find calls to operator[]
add logging when there is no hwloc or hwloc fails and single-NUMA fallback is triggered
forward-declare numa_assignment
changed compute_assignments to return a shared_ptr to vector of numa_assignment
add missing reference in arguments of create_thread
use std::min_element in choose_least_loaded
a few stylistic polishes

@witek-formanski can you please mark relevant comments as resolved? And update the cover letter to:

This is a follow-up for the #3297 pull request, adding a new feature to the reactor_backend_asymmetric_uring in the last commit - Introduce NUMA and SMT awareness. To be merged only after the main PR is merged.

Changes affect the algorithm used to asign networking cores to app cores, so that it takes NUMA and SMT layout into account.
The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node, and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step.

Changes were tested manually by mocking hwloc topology in code. The tests produced expected assignment results.

Closes #3315.

Deexie · 2026-05-25T08:48:22Z

+
+        uring_assignments = compute_assignments(allocations, async_worker_cpus);
+
+        const bool is_master = (*uring_assignments)[0].is_master;


Assert that uring_assignments != nullptr && !(*uring_assignment).empty()

Deexie · 2026-05-25T08:49:46Z

                using namespace uring;
-                const bool is_master = is_master_shard(i, async_worker_cpus);
-                const unsigned uring_group_id = get_uring_group_id(i, async_worker_cpus);
+                SEASTAR_ASSERT(uring_assignments);


We should assert that i-th element is there as well

Or before the loop that the size of uring_assignments is as expected

Deexie · 2026-05-25T08:52:46Z

+            hwloc_successful_load = true;
+        }
+
+        hwloc_topology_destroy(topology);


We should destroy topology on failure

witek-formanski and others added 14 commits May 13, 2026 17:03

Extract prepare_sqe

0f5c46c

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com> Co-authored-by: mikolajbilski <mjbilski@gmail.com> Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com> Co-authored-by: witek-formanski <witekformanski@gmail.com>

witek-formanski force-pushed the feat/numa-awareness branch from 6ee26ae to 2d80442 Compare May 16, 2026 17:52

Deexie requested a review from Copilot May 18, 2026 07:51

Copilot started reviewing on behalf of Deexie May 18, 2026 07:52 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Deexie reviewed May 18, 2026

View reviewed changes

mikolajbilski force-pushed the feat/numa-awareness branch from 2d80442 to 1175c41 Compare May 22, 2026 19:53

Deexie reviewed May 25, 2026

View reviewed changes

Deexie requested review from avikivity and xemul May 25, 2026 08:59

witek-formanski mentioned this pull request Jun 8, 2026

doc: Add documentation for reactor_backend_asymmetric_uring zpp-2025-io-uring/seastar#14

Draft


		uring_assignments = compute_assignments(allocations, async_worker_cpus);

		const bool is_master = (*uring_assignments)[0].is_master;

Conversation

witek-formanski commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

witek-formanski commented May 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikolajbilski May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Deexie commented May 18, 2026

Uh oh!

Deexie commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikolajbilski commented May 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

witek-formanski commented May 16, 2026 •

edited

Loading

mikolajbilski May 22, 2026 •

edited

Loading

Deexie commented May 18, 2026 •

edited

Loading