Skip to content

Introduce NUMA and SMT awareness to the asymmetric io_uring reactor backend#3405

Draft
witek-formanski wants to merge 15 commits into
scylladb:masterfrom
pixelkubek:feat/numa-awareness
Draft

Introduce NUMA and SMT awareness to the asymmetric io_uring reactor backend#3405
witek-formanski wants to merge 15 commits into
scylladb:masterfrom
pixelkubek:feat/numa-awareness

Conversation

@witek-formanski

@witek-formanski witek-formanski commented May 16, 2026

Copy link
Copy Markdown
Contributor

This is a follow-up for the #3297 pull request, adding a new feature to the reactor_backend_asymmetric_uring in the last commit - Introduce NUMA and SMT awareness. To be merged only after the main PR is merged.

Changes affect the algorithm used to asign networking cores to app cores, so that it takes NUMA and SMT layout into account.
The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node, and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step.

Changes were tested manually by mocking hwloc topology in code. The tests produced expected assignment results.

Closes #3315 .

witek-formanski and others added 14 commits May 13, 2026 17:03
Members are private already by default.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
This value is related to io_uring (specifically it's submission queue length), not to specific backend implementation. In the future this value will be shared among more than one backend using io_uring.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
This class will be a base for all reactor backends using io_uring.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
Only methods that are not moved are I/O methods and `get_backend_name`.
We're also deriving reactor_backend_uring from
reactor_backend_uring_base.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
…_base

The implementations of base completion classes were moved from methods to the base class, as they would be the same in both asymmetric and symmetric implementations. Base completion classes which would end as identical classes, such as the ones for recv_some and read_some were merged into a single class. The specific implementations differ only by the inclusion/exclusion of fast track for symmetric/asymmetric backends, respectively.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
This method is overriden in the asymmetric class in a later commit that introduces buffer ring support.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
Introduce a new class - reactor_backend_asymmetric_uring. This class represents an io_uring-based backend, with networking I/O being handled by io_uring threads living on separate vCPUs, as implemented in a later patch.

The "fast track" - speculation on FDs has been removed when compared to the implementation of reactor_backend_uring. The fast track bypassed the backend mechanism completely, executing given operation, e.g. recvmsg() immediately, if the speculation returned true. This contradicts the idea of moving the I/O to different vCPUs, which is the core idea of this backend.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
This commit also introduces detection of the availability of asymmetric io_uring reactor backend. Here the conditions for its availability are the same as for symmetric io_uring backend. It is changed in a later commit.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
A given asymmetric io_uring backend will either be provided with a ready io_uring, or a ring_fd, so it can create a new instance and use IORING_SETUP_ATTACH_WQ to attach it to an existing one.
The variants include std::monostate, to avoid having to pass a value to other backends.
To avoid changing struct size depending on conditional compilation, the io_uring is stored as std::any

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
There are two functions - `try_create_base_asymmetric_uring` which creates a uring with workers on a specific CPU and `try_create_attached_asymmetric_uring` which will have its workers attached to an existing instance.
Both rely on `try_create_asymmetric_uring_impl` which is largely copied from `try_create_uring`.
To allow the backend to create an attached uring instance or use a ready one, `try_create_asymmetric_uring` takes a variant of these two possibilities.
Those methods are created in order to be used in the future during backend creation.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
Introduce CLI option to specify async workers cpuset

The new option (async-workers-cpuset) is used to cpecify the CPUs on which workers will be spawned. The recommended way to use this option is to specify it alongside --cpuset. When specified with neither --cpuset nor --smp, it will remove the specified cpus from shard allocation. When --smp or --cpuset are specified, it is possible that worker and app cpusets overlap.

Add asymmetric uring backend creation to smp::configure

Determine the async worker and app cores cpumasks.

Implement uring allocation in smp::configure

For shards which represent a group of shards using a singular networking core, allocate the uring in smp::configure.
This allows uring_fd passing to other shards.
Add a barrier to ensure all uring_fds are existant.

Use the new uring creation method in reactor_backend_asymmetric_uring constructor.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
…licitly

If user specifies which cpus to use for async workers, but does not
specify cpuset for smp, then we should prevent these 2 from overlap, as
this might resolve in performance degradation.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
Add a detect_asymmetric_io_uring function, which checks whether an asymmetric uring can be created.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
@witek-formanski

Copy link
Copy Markdown
Contributor Author

force-push to remove trailing whitespace

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a follow-up to #3297 (asymmetric_io_uring backend) that introduces NUMA and SMT awareness when mapping app shards to networking (worker) cores. Instead of the simple shard % n round-robin assignment, the new uring::compute_assignments performs three passes: SMT-sibling preference, same-NUMA-node preference, then any remaining networking core, each pass picking the least-loaded candidate. Topology data is gathered from hwloc when available, with a degenerate single-NUMA fallback otherwise.

Changes:

  • Add compute_assignments (NUMA/SMT-aware shard→networking-core mapping) and plumb its numa_assignment through smp::configure to drive master/group-id selection per shard.
  • Introduce async_workers_cpuset option, async_worker_allocation helper, overlap/overprovisioned validation, and a shared master_uring_fds vector synchronized via a std::barrier.
  • Refactor reactor_backend_uring into a reactor_backend_uring_base + per-backend final classes, share completion base classes, and add reactor_backend_asymmetric_uring using attached SQPOLL rings.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
src/core/reactor.cc Adds async-worker cpuset option/allocation, master uring orchestration with barrier, and per-shard reactor_config plumbing.
src/core/reactor_backend.hh Declares uring:: helpers (try_create_*_asymmetric_uring, numa_assignment, compute_assignments) and shared constants.
src/core/reactor_backend.cc Extracts reactor_backend_uring_base, adds asymmetric backend, hwloc-based NUMA/SMT-aware assignment algorithm.
include/seastar/core/reactor.hh Adds async_worker_allocation struct and friend declarations for new backend classes.
include/seastar/core/reactor_config.hh Adds compile_safe_io_uring alias and asymmetric_uring variant on reactor_config; declares async_workers_cpuset option.
include/seastar/core/internal/pollable_fd.hh Adds friend declarations for the new backend classes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/core/reactor.cc Outdated
Comment thread src/core/reactor_backend.cc
Comment thread src/core/reactor_backend.cc Outdated
Comment thread src/core/reactor_backend.cc Outdated
Comment thread src/core/reactor_backend.cc
Comment thread src/core/reactor_backend.cc
Comment thread src/core/reactor.cc
Comment thread src/core/reactor.cc
Comment on lines +4033 to +4035
"CPUs to use (in cpuset(7) format) for backend's async workers."
" Only applicable to, and required by, the asymmetric_io_uring reactor backend (see --reactor-backend)."
" Note that if the --cpuset is not set, using --async-workers-cpuset will restrict the CPUs for the SMP.")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what exactly Copilot means here

@mikolajbilski mikolajbilski May 22, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But still, this is relevant to #3297 I believe

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original PR should not describe the NUMA/SMT-aware algorithm, as it is implemented in this PR as a followup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if this PR needs any documentation/comments update about NUMA.

Comment thread src/core/reactor_backend.cc Outdated
Comment thread src/core/reactor_backend.cc
Comment thread src/core/reactor.cc Outdated
Comment thread src/core/reactor_backend.hh Outdated
Comment thread src/core/reactor.cc Outdated
Comment thread src/core/reactor.cc Outdated
Comment thread src/core/reactor_backend.cc
Comment thread src/core/reactor_backend.cc Outdated
Comment thread src/core/reactor_backend.cc Outdated
Comment thread src/core/reactor_backend.cc Outdated
return best;
};

auto assign_pass = [&](auto&& candidate_selector) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is weird

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to assignment_pass, but it doesn't seem to be much better, do you have any suggestions for a name?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about assign_least_loaded or maybe_assign_least_loaded?

Comment thread src/core/reactor_backend.cc Outdated
Comment thread src/core/reactor_backend.cc Outdated
@Deexie

Deexie commented May 18, 2026

Copy link
Copy Markdown
Contributor

This is a follow-up for the #3297 pull request, adding a new feature to the reactor_backend_asymmetric_uring in the last commit - Introduce NUMA and SMT awareness. To be merged only after the main PR is merged.

^ This can stay here, but it should be removed when #3297 gets merged.

Closes #3315 .

^ This should be last.

Changes were tested manually by mocking hwloc topology in code. The tests produced expected assignment results.

^ This should be the second.

Changes affect the algorithm used to asign networking cores to app cores, so that it takes NUMA and SMT layout into account. The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node, and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step.

^ This should be the first.

@Deexie

Deexie commented May 18, 2026

Copy link
Copy Markdown
Contributor

Well, right, Copilot reviewed everything - please take a look and apply the fixes to #3297 if relevant.

This commit changes the algorithm used to asign networking cores to app cores to take NUMA and SMT layout into account.
The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node,
and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step.

Co-authored-by: MrD4rkne <marcinsz.pv@gmail.com>
Co-authored-by: mikolajbilski <mjbilski@gmail.com>
Co-authored-by: pixelkubek <czyszczon.kuba@gmail.com>
Co-authored-by: witek-formanski <witekformanski@gmail.com>
@mikolajbilski mikolajbilski force-pushed the feat/numa-awareness branch from 2d80442 to 1175c41 Compare May 22, 2026 19:53
@mikolajbilski

Copy link
Copy Markdown

This force-push addresses the previous feedback by changing the following:

  • fix incorrect master_uring_fds size bug
  • add error handling to hwloc calls
  • change unnecessary find calls to operator[]
  • add logging when there is no hwloc or hwloc fails and single-NUMA fallback is triggered
  • forward-declare numa_assignment
  • changed compute_assignments to return a shared_ptr to vector of numa_assignment
  • add missing reference in arguments of create_thread
  • use std::min_element in choose_least_loaded
  • a few stylistic polishes

@witek-formanski can you please mark relevant comments as resolved? And update the cover letter to:

This is a follow-up for the #3297 pull request, adding a new feature to the reactor_backend_asymmetric_uring in the last commit - Introduce NUMA and SMT awareness. To be merged only after the main PR is merged.

Changes affect the algorithm used to asign networking cores to app cores, so that it takes NUMA and SMT layout into account.
The new algorithm will first assign networking cores to the app cores which are their SMT-siblings, then to app cores that are on the same NUMA node, and finally to the app cores on different NUMA nodes, trying to keep the assignment balanced at each step.

Changes were tested manually by mocking hwloc topology in code. The tests produced expected assignment results.

Closes #3315.

Comment thread src/core/reactor.cc

uring_assignments = compute_assignments(allocations, async_worker_cpus);

const bool is_master = (*uring_assignments)[0].is_master;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert that uring_assignments != nullptr && !(*uring_assignment).empty()

Comment thread src/core/reactor.cc
using namespace uring;
const bool is_master = is_master_shard(i, async_worker_cpus);
const unsigned uring_group_id = get_uring_group_id(i, async_worker_cpus);
SEASTAR_ASSERT(uring_assignments);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should assert that i-th element is there as well

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or before the loop that the size of uring_assignments is as expected

hwloc_successful_load = true;
}

hwloc_topology_destroy(topology);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should destroy topology on failure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Asymmetric io_uring backend] - Take in consideration NUMA when distributing the workers

6 participants