Skip to content

refactor(BA-5857): Pin container memory to NUMA-local node via CpusetMems#11222

Open
rapsealk wants to merge 10 commits into
mainfrom
perf/11217-numa-cpuset-mems
Open

refactor(BA-5857): Pin container memory to NUMA-local node via CpusetMems#11222
rapsealk wants to merge 10 commits into
mainfrom
perf/11217-numa-cpuset-mems

Conversation

@rapsealk
Copy link
Copy Markdown
Member

@rapsealk rapsealk commented Apr 22, 2026

Closes #11217 (BA-5857)
Refs #11216

Summary

  • Set HostConfig.CpusetMems to the NUMA node ID when all allocated CPU cores are on the same node.
  • Omit CpusetMems for multi-node allocations so the kernel's default NUMA memory placement policy applies.
  • Short-circuit on libnuma.num_nodes() <= 1 so non-NUMA Linux hosts and macOS / WSL / Docker Desktop don't get CpusetMems="0" pinned on every container.
  • Use per-allocated-core libnuma.node_of_cpu() instead of a host-wide list_devices() scan.

Why

Pre-PR the NUMA memory pinning code was commented out, so CPU-only workloads on NUMA hosts could pay 2-3× remote-memory access latency. Part of the agent-runtime performance epic #11216.

Test plan

  • pants check src/ai/backend/agent/docker:: and pants test tests/unit/agent:: pass
  • On a NUMA host, verify container HostConfig.CpusetMems reflects the allocation node
  • On a single-socket / non-NUMA host, verify CpusetMems is absent

@rapsealk rapsealk added this to the 26.5 milestone Apr 22, 2026
Copilot AI review requested due to automatic review settings April 22, 2026 04:25
@github-actions github-actions Bot added the size:S 10~30 LoC label Apr 22, 2026
@github-actions github-actions Bot added the comp:agent Related to Agent component label Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds NUMA-aware memory pinning for CPU-only Docker containers by setting HostConfig.CpusetMems when the allocated CPU cores are all on the same NUMA node, and omitting it otherwise.

Changes:

  • Build HostConfig in CPUPlugin.generate_docker_args() and conditionally set CpusetMems when allocation is NUMA-local.
  • Add a changelog entry for the enhancement.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/ai/backend/agent/docker/intrinsic.py Detects allocated cores’ NUMA node(s) and sets HostConfig.CpusetMems only for single-node allocations.
changes/11217.enhance.md Documents the NUMA-local CpusetMems behavior change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +440 to +444
devices = await self.list_devices()
core_to_node = {int(dev.device_id): dev.numa_node for dev in devices}
allocated_nodes: set[int] = set()
for core in cores:
node = core_to_node.get(core)
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_docker_args() now calls await self.list_devices() just to map allocated core IDs to NUMA nodes. list_devices() iterates all available cores and calls libnuma.node_of_cpu() for each, so this adds per-container startup overhead that scales with host CPU count. Consider computing the NUMA node only for the allocated cores (e.g., call libnuma.node_of_cpu(core) inside the loop), or cache a core_id -> numa_node mapping on the plugin instance and reuse it.

Suggested change
devices = await self.list_devices()
core_to_node = {int(dev.device_id): dev.numa_node for dev in devices}
allocated_nodes: set[int] = set()
for core in cores:
node = core_to_node.get(core)
allocated_nodes: set[int] = set()
for core in cores:
node = libnuma.node_of_cpu(core)

Copilot uses AI. Check for mistakes.
Comment on lines +449 to +450
# Pin memory to the NUMA node only when the allocation is fully node-local;
# Docker does not expose a multi-node cpuset.mems equivalent via HostConfig.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says Docker does not expose a multi-node cpuset.mems equivalent via HostConfig, but HostConfig.CpusetMems accepts the same cpuset list/range syntax as cpuset.mems (e.g., "0,1"). If the intent is to only pin when node-local, reword this to reflect that it’s a deliberate policy choice (and why), not a Docker API limitation.

Suggested change
# Pin memory to the NUMA node only when the allocation is fully node-local;
# Docker does not expose a multi-node cpuset.mems equivalent via HostConfig.
# Pin memory only when the CPU allocation is fully node-local.
# For multi-node CPU allocations, intentionally leave CpusetMems unset
# so Docker/kernel default NUMA memory placement policy can apply.

Copilot uses AI. Check for mistakes.
Comment thread src/ai/backend/agent/docker/intrinsic.py
@rapsealk rapsealk changed the title enhance(agent): pin container memory to NUMA-local node via CpusetMems refactor(agent): Pin container memory to NUMA-local node via CpusetMems Apr 22, 2026
rapsealk added a commit that referenced this pull request Apr 22, 2026
Adds unit coverage for generate_docker_args() setting HostConfig.CpusetMems
only when the CPU allocation is fully within a single NUMA node, and
omitting the key for multi-node or unknown/negative NUMA mappings.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size:L 100~500 LoC and removed size:S 10~30 LoC labels Apr 22, 2026
rapsealk added a commit that referenced this pull request Apr 22, 2026
- Query libnuma.node_of_cpu() only for allocated cores instead of calling
  self.list_devices(), which iterates every core on the host on each
  container start. Drops per-start overhead from O(host_cores) to
  O(allocated_cores).
- Reword the CpusetMems comment to reflect that omission for multi-node
  allocations is a deliberate policy choice (defer to kernel default NUMA
  placement) rather than a Docker HostConfig API limitation.
- Update the NUMA-locality unit tests to patch libnuma.node_of_cpu
  directly, matching the new implementation seam.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rapsealk added a commit that referenced this pull request Apr 22, 2026
…mbstone

Addresses review feedback on PR #11222:
- Skip CpusetMems entirely when libnuma reports NUMA unsupported or
  the host has a single node, so non-NUMA Linux hosts and macOS/WSL
  dev environments do not get CpusetMems="0" pinned on every container.
- Add a unit test covering the non-NUMA short-circuit branch.
- Remove the twin commented-out CpusetMems tombstone from
  src/ai/backend/agent/dummy/intrinsic.py.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk
Copy link
Copy Markdown
Member Author

Review summary from an independent pass + resolutions pushed to this branch.

Verdict: ship with changes — addressed.

Finding Resolution
libnuma.node_of_cpu returns 0 on non-NUMA/non-Linux hosts → every container would get CpusetMems="0" pinned Short-circuit on libnuma.num_nodes() > 1 before the per-core loop (commit 0eda461ec)
Perf: list_devices() iterates every host core just to build a lookup Switched to per-core libnuma.node_of_cpu(core) (commit 64529b24a)
Comment claimed Docker HostConfig can't express multi-node CpusetMems — factually wrong Reworded as deliberate policy: defer to kernel default NUMA placement for multi-node allocations (commit 64529b24a)
Twin tombstone in dummy/intrinsic.py:148 Removed (commit 0eda461ec)
No unit tests for the new branching 4 tests added: single-node, multi-node, unknown/negative-node, non-NUMA short-circuit

Scope stayed tight: only docker/intrinsic.py, dummy/intrinsic.py, and tests/unit/agent/test_docker_intrinsic.py touched. libnuma.node_of_cpu is a ctypes call into an in-memory lookup, so calling it synchronously inside generate_docker_args is fine per the agent's async rules. Scheduler default is AffinityPolicy.PREFER_SINGLE_NODE, so the common case hits the pin-to-node branch; multi-node correctly falls through to kernel default rather than forcing "0,1" across all nodes.

@rapsealk rapsealk requested review from a team and achimnol April 22, 2026 06:17
rapsealk added a commit that referenced this pull request Apr 22, 2026
- Remove unused local_config + _docker mock from the NUMA-locality
  cpu_plugin fixture: CPUPlugin.generate_docker_args doesn't read them.
- Replace the case_id parametrize column with pytest's `ids=` kwarg;
  pytest already labels failures via the parametrize id.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +440 to +456
# Skip CpusetMems entirely when NUMA is unsupported (non-Linux hosts,
# Linux without libnuma.so, Docker Desktop, WSL, etc.) or when the host
# exposes a single node; libnuma.node_of_cpu would otherwise fall back
# to 0 and cause every container to be pinned to "CpusetMems": "0".
if libnuma.num_nodes() > 1:
allocated_nodes: set[int] = set()
for core in cores:
node = libnuma.node_of_cpu(core)
if node < 0:
allocated_nodes.clear()
break
allocated_nodes.add(node)
# Pin memory only when the CPU allocation is fully node-local.
# For multi-node CPU allocations, intentionally leave CpusetMems unset
# so Docker/kernel default NUMA memory placement policy can apply.
if len(allocated_nodes) == 1:
host_config["CpusetMems"] = str(next(iter(allocated_nodes)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about extracting this logic into a separate private method?


@staticmethod
@contextmanager
def _patch_node_of_cpu(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method is intended to reproduce specific test scenarios, it would be more readable to inject fixtures for each scenario and apply parametrization.

rapsealk added a commit that referenced this pull request Apr 24, 2026
…ests

Address review comments from @jopemachine on PR #11222:
- Factor the NUMA-locality branch out of `generate_docker_args` into
  `CPUPlugin._resolve_node_local_mem`, so the main method stays flat and
  the helper can be unit-tested without plumbing a full `device_alloc`.
- Collapse the four scenario-specific tests into a single
  `_NumaScenario`-keyed fixture + parametrized tests — one directly against
  `_resolve_node_local_mem`, one end-to-end through `generate_docker_args`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rapsealk and others added 8 commits April 24, 2026 22:25
Set `CpusetMems` in Docker HostConfig when all allocated CPU cores reside
on a single NUMA node, so container memory is pinned to the same node as
its CPUs. The key is omitted when the allocation spans multiple nodes or
when NUMA info is unavailable on the host.

Closes #11217

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds unit coverage for generate_docker_args() setting HostConfig.CpusetMems
only when the CPU allocation is fully within a single NUMA node, and
omitting the key for multi-node or unknown/negative NUMA mappings.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Query libnuma.node_of_cpu() only for allocated cores instead of calling
  self.list_devices(), which iterates every core on the host on each
  container start. Drops per-start overhead from O(host_cores) to
  O(allocated_cores).
- Reword the CpusetMems comment to reflect that omission for multi-node
  allocations is a deliberate policy choice (defer to kernel default NUMA
  placement) rather than a Docker HostConfig API limitation.
- Update the NUMA-locality unit tests to patch libnuma.node_of_cpu
  directly, matching the new implementation seam.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mbstone

Addresses review feedback on PR #11222:
- Skip CpusetMems entirely when libnuma reports NUMA unsupported or
  the host has a single node, so non-NUMA Linux hosts and macOS/WSL
  dev environments do not get CpusetMems="0" pinned on every container.
- Add a unit test covering the non-NUMA short-circuit branch.
- Remove the twin commented-out CpusetMems tombstone from
  src/ai/backend/agent/dummy/intrinsic.py.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove unused local_config + _docker mock from the NUMA-locality
  cpu_plugin fixture: CPUPlugin.generate_docker_args doesn't read them.
- Replace the case_id parametrize column with pytest's `ids=` kwarg;
  pytest already labels failures via the parametrize id.

Refs #11217
Refs #11222

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ests

Address review comments from @jopemachine on PR #11222:
- Factor the NUMA-locality branch out of `generate_docker_args` into
  `CPUPlugin._resolve_node_local_mem`, so the main method stays flat and
  the helper can be unit-tested without plumbing a full `device_alloc`.
- Collapse the four scenario-specific tests into a single
  `_NumaScenario`-keyed fixture + parametrized tests — one directly against
  `_resolve_node_local_mem`, one end-to-end through `generate_docker_args`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-review follow-ups on PR #11222:
- Add an `empty_allocation` scenario to `_NUMA_SCENARIOS` so the `cores=[]`
  path through `_resolve_node_local_mem` / `generate_docker_args` is locked
  in (must not emit `CpusetMems`).
- Document in `kubernetes/intrinsic.py::CPUPlugin.generate_docker_args` why
  the NUMA pinning from the Docker backend is not mirrored (k8s drives
  node-local memory via the kubelet Topology Manager).
- Document in `dummy/intrinsic.py::CPUPlugin.generate_docker_args` why
  dummy intentionally skips the NUMA pinning (never runs real containers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk force-pushed the perf/11217-numa-cpuset-mems branch from c5c3971 to 5732246 Compare April 24, 2026 13:26
…quests

The previous comment implied Kubernetes was benefiting from the kubelet
Topology Manager for NUMA memory placement, but this backend does not
emit per-container CPU/memory requests/limits today — so Topology Manager
has no hook to act on (Guaranteed-QoS pods are required). Rewrite the
comment to state that accurately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk requested a review from jopemachine April 24, 2026 13:43
The optimization only fires when the CPU allocation fits on a single NUMA
node. Under the default `INTERLEAVED` affinity policy, multi-core
allocations spread across nodes and the change is a no-op. Make this
explicit so operators know to pair the upgrade with `PREFER_SINGLE_NODE`
if they want the latency win for multi-core workloads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk changed the title refactor(agent): Pin container memory to NUMA-local node via CpusetMems refactor(BA-5857): Pin container memory to NUMA-local node via CpusetMems Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable CpusetMems when CPU allocation is NUMA-local

3 participants