refactor(BA-5857): Pin container memory to NUMA-local node via CpusetMems by rapsealk · Pull Request #11222 · lablup/backend.ai

rapsealk · 2026-04-22T04:25:25Z

Summary

Set HostConfig.CpusetMems to the NUMA node ID when all allocated CPU cores are on the same node.
Omit CpusetMems for multi-node allocations so the kernel's default NUMA memory placement policy applies.
Short-circuit on libnuma.num_nodes() <= 1 so non-NUMA Linux hosts and macOS / WSL / Docker Desktop don't get CpusetMems="0" pinned on every container.
Use per-allocated-core libnuma.node_of_cpu() instead of a host-wide list_devices() scan.

Why

Pre-PR the NUMA memory pinning code was commented out, so CPU-only workloads on NUMA hosts could pay 2-3× remote-memory access latency. Part of the agent-runtime performance epic #11216.

Test plan

pants check src/ai/backend/agent/docker:: and pants test tests/unit/agent:: pass
On a NUMA host, verify container HostConfig.CpusetMems reflects the allocation node
On a single-socket / non-NUMA host, verify CpusetMems is absent

Copilot

Pull request overview

Adds NUMA-aware memory pinning for CPU-only Docker containers by setting HostConfig.CpusetMems when the allocated CPU cores are all on the same NUMA node, and omitting it otherwise.

Changes:

Build HostConfig in CPUPlugin.generate_docker_args() and conditionally set CpusetMems when allocation is NUMA-local.
Add a changelog entry for the enhancement.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`src/ai/backend/agent/docker/intrinsic.py`	Detects allocated cores’ NUMA node(s) and sets `HostConfig.CpusetMems` only for single-node allocations.
`changes/11217.enhance.md`	Documents the NUMA-local `CpusetMems` behavior change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T04:29:44Z

+        devices = await self.list_devices()
+        core_to_node = {int(dev.device_id): dev.numa_node for dev in devices}
+        allocated_nodes: set[int] = set()
+        for core in cores:
+            node = core_to_node.get(core)


generate_docker_args() now calls await self.list_devices() just to map allocated core IDs to NUMA nodes. list_devices() iterates all available cores and calls libnuma.node_of_cpu() for each, so this adds per-container startup overhead that scales with host CPU count. Consider computing the NUMA node only for the allocated cores (e.g., call libnuma.node_of_cpu(core) inside the loop), or cache a core_id -> numa_node mapping on the plugin instance and reuse it.

Suggested change

devices = await self.list_devices()

core_to_node = {int(dev.device_id): dev.numa_node for dev in devices}

allocated_nodes: set[int] = set()

for core in cores:

node = core_to_node.get(core)

allocated_nodes: set[int] = set()

for core in cores:

node = libnuma.node_of_cpu(core)

Copilot · 2026-04-22T04:29:44Z

+        # Pin memory to the NUMA node only when the allocation is fully node-local;
+        # Docker does not expose a multi-node cpuset.mems equivalent via HostConfig.


The comment says Docker does not expose a multi-node cpuset.mems equivalent via HostConfig, but HostConfig.CpusetMems accepts the same cpuset list/range syntax as cpuset.mems (e.g., "0,1"). If the intent is to only pin when node-local, reword this to reflect that it’s a deliberate policy choice (and why), not a Docker API limitation.

Suggested change

# Pin memory to the NUMA node only when the allocation is fully node-local;

# Docker does not expose a multi-node cpuset.mems equivalent via HostConfig.

# Pin memory only when the CPU allocation is fully node-local.

# For multi-node CPU allocations, intentionally leave CpusetMems unset

# so Docker/kernel default NUMA memory placement policy can apply.

Adds unit coverage for generate_docker_args() setting HostConfig.CpusetMems only when the CPU allocation is fully within a single NUMA node, and omitting the key for multi-node or unknown/negative NUMA mappings. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Query libnuma.node_of_cpu() only for allocated cores instead of calling self.list_devices(), which iterates every core on the host on each container start. Drops per-start overhead from O(host_cores) to O(allocated_cores). - Reword the CpusetMems comment to reflect that omission for multi-node allocations is a deliberate policy choice (defer to kernel default NUMA placement) rather than a Docker HostConfig API limitation. - Update the NUMA-locality unit tests to patch libnuma.node_of_cpu directly, matching the new implementation seam. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mbstone Addresses review feedback on PR #11222: - Skip CpusetMems entirely when libnuma reports NUMA unsupported or the host has a single node, so non-NUMA Linux hosts and macOS/WSL dev environments do not get CpusetMems="0" pinned on every container. - Add a unit test covering the non-NUMA short-circuit branch. - Remove the twin commented-out CpusetMems tombstone from src/ai/backend/agent/dummy/intrinsic.py. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rapsealk · 2026-04-22T06:04:49Z

Review summary from an independent pass + resolutions pushed to this branch.

Verdict: ship with changes — addressed.

Finding	Resolution
`libnuma.node_of_cpu` returns `0` on non-NUMA/non-Linux hosts → every container would get `CpusetMems="0"` pinned	Short-circuit on `libnuma.num_nodes() > 1` before the per-core loop (commit `0eda461ec`)
Perf: `list_devices()` iterates every host core just to build a lookup	Switched to per-core `libnuma.node_of_cpu(core)` (commit `64529b24a`)
Comment claimed Docker `HostConfig` can't express multi-node `CpusetMems` — factually wrong	Reworded as deliberate policy: defer to kernel default NUMA placement for multi-node allocations (commit `64529b24a`)
Twin tombstone in `dummy/intrinsic.py:148`	Removed (commit `0eda461ec`)
No unit tests for the new branching	4 tests added: single-node, multi-node, unknown/negative-node, non-NUMA short-circuit

Scope stayed tight: only docker/intrinsic.py, dummy/intrinsic.py, and tests/unit/agent/test_docker_intrinsic.py touched. libnuma.node_of_cpu is a ctypes call into an in-memory lookup, so calling it synchronously inside generate_docker_args is fine per the agent's async rules. Scheduler default is AffinityPolicy.PREFER_SINGLE_NODE, so the common case hits the pin-to-node branch; multi-node correctly falls through to kernel default rather than forcing "0,1" across all nodes.

- Remove unused local_config + _docker mock from the NUMA-locality cpu_plugin fixture: CPUPlugin.generate_docker_args doesn't read them. - Replace the case_id parametrize column with pytest's `ids=` kwarg; pytest already labels failures via the parametrize id. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jopemachine · 2026-04-22T09:41:09Z

+        # Skip CpusetMems entirely when NUMA is unsupported (non-Linux hosts,
+        # Linux without libnuma.so, Docker Desktop, WSL, etc.) or when the host
+        # exposes a single node; libnuma.node_of_cpu would otherwise fall back
+        # to 0 and cause every container to be pinned to "CpusetMems": "0".
+        if libnuma.num_nodes() > 1:
+            allocated_nodes: set[int] = set()
+            for core in cores:
+                node = libnuma.node_of_cpu(core)
+                if node < 0:
+                    allocated_nodes.clear()
+                    break
+                allocated_nodes.add(node)
+            # Pin memory only when the CPU allocation is fully node-local.
+            # For multi-node CPU allocations, intentionally leave CpusetMems unset
+            # so Docker/kernel default NUMA memory placement policy can apply.
+            if len(allocated_nodes) == 1:
+                host_config["CpusetMems"] = str(next(iter(allocated_nodes)))


How about extracting this logic into a separate private method?

jopemachine · 2026-04-22T09:44:14Z

+
+    @staticmethod
+    @contextmanager
+    def _patch_node_of_cpu(


Since this method is intended to reproduce specific test scenarios, it would be more readable to inject fixtures for each scenario and apply parametrization.

@jopemachine

…ests Address review comments from @jopemachine on PR #11222: - Factor the NUMA-locality branch out of `generate_docker_args` into `CPUPlugin._resolve_node_local_mem`, so the main method stays flat and the helper can be unit-tested without plumbing a full `device_alloc`. - Collapse the four scenario-specific tests into a single `_NumaScenario`-keyed fixture + parametrized tests — one directly against `_resolve_node_local_mem`, one end-to-end through `generate_docker_args`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Set `CpusetMems` in Docker HostConfig when all allocated CPU cores reside on a single NUMA node, so container memory is pinned to the same node as its CPUs. The key is omitted when the allocation spans multiple nodes or when NUMA info is unavailable on the host. Closes #11217 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds unit coverage for generate_docker_args() setting HostConfig.CpusetMems only when the CPU allocation is fully within a single NUMA node, and omitting the key for multi-node or unknown/negative NUMA mappings. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Query libnuma.node_of_cpu() only for allocated cores instead of calling self.list_devices(), which iterates every core on the host on each container start. Drops per-start overhead from O(host_cores) to O(allocated_cores). - Reword the CpusetMems comment to reflect that omission for multi-node allocations is a deliberate policy choice (defer to kernel default NUMA placement) rather than a Docker HostConfig API limitation. - Update the NUMA-locality unit tests to patch libnuma.node_of_cpu directly, matching the new implementation seam. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mbstone Addresses review feedback on PR #11222: - Skip CpusetMems entirely when libnuma reports NUMA unsupported or the host has a single node, so non-NUMA Linux hosts and macOS/WSL dev environments do not get CpusetMems="0" pinned on every container. - Add a unit test covering the non-NUMA short-circuit branch. - Remove the twin commented-out CpusetMems tombstone from src/ai/backend/agent/dummy/intrinsic.py. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Remove unused local_config + _docker mock from the NUMA-locality cpu_plugin fixture: CPUPlugin.generate_docker_args doesn't read them. - Replace the case_id parametrize column with pytest's `ids=` kwarg; pytest already labels failures via the parametrize id. Refs #11217 Refs #11222 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@jopemachine

…ests Address review comments from @jopemachine on PR #11222: - Factor the NUMA-locality branch out of `generate_docker_args` into `CPUPlugin._resolve_node_local_mem`, so the main method stays flat and the helper can be unit-tested without plumbing a full `device_alloc`. - Collapse the four scenario-specific tests into a single `_NumaScenario`-keyed fixture + parametrized tests — one directly against `_resolve_node_local_mem`, one end-to-end through `generate_docker_args`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Post-review follow-ups on PR #11222: - Add an `empty_allocation` scenario to `_NUMA_SCENARIOS` so the `cores=[]` path through `_resolve_node_local_mem` / `generate_docker_args` is locked in (must not emit `CpusetMems`). - Document in `kubernetes/intrinsic.py::CPUPlugin.generate_docker_args` why the NUMA pinning from the Docker backend is not mirrored (k8s drives node-local memory via the kubelet Topology Manager). - Document in `dummy/intrinsic.py::CPUPlugin.generate_docker_args` why dummy intentionally skips the NUMA pinning (never runs real containers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…quests The previous comment implied Kubernetes was benefiting from the kubelet Topology Manager for NUMA memory placement, but this backend does not emit per-container CPU/memory requests/limits today — so Topology Manager has no hook to act on (Guaranteed-QoS pods are required). Rewrite the comment to state that accurately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The optimization only fires when the CPU allocation fits on a single NUMA node. Under the default `INTERLEAVED` affinity policy, multi-core allocations spread across nodes and the change is a no-op. Make this explicit so operators know to pair the upgrade with `PREFER_SINGLE_NODE` if they want the latency win for multi-core workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rapsealk added this to the 26.5 milestone Apr 22, 2026

Copilot AI review requested due to automatic review settings April 22, 2026 04:25

github-actions Bot added the size:S 10~30 LoC label Apr 22, 2026

github-actions Bot assigned rapsealk Apr 22, 2026

github-actions Bot added the comp:agent Related to Agent component label Apr 22, 2026

Copilot started reviewing on behalf of rapsealk April 22, 2026 04:26 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

rapsealk changed the title ~~enhance(agent): pin container memory to NUMA-local node via CpusetMems~~ refactor(agent): Pin container memory to NUMA-local node via CpusetMems Apr 22, 2026

github-actions Bot added size:L 100~500 LoC and removed size:S 10~30 LoC labels Apr 22, 2026

rapsealk requested review from a team and achimnol April 22, 2026 06:17

jopemachine reviewed Apr 22, 2026

View reviewed changes

rapsealk and others added 8 commits April 24, 2026 22:25

chore(changes): Rename news fragment to PR number

f7a22a1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rapsealk force-pushed the perf/11217-numa-cpuset-mems branch from c5c3971 to 5732246 Compare April 24, 2026 13:26

rapsealk requested a review from jopemachine April 24, 2026 13:43

rapsealk changed the title ~~refactor(agent): Pin container memory to NUMA-local node via CpusetMems~~ refactor(BA-5857): Pin container memory to NUMA-local node via CpusetMems Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(BA-5857): Pin container memory to NUMA-local node via CpusetMems#11222

refactor(BA-5857): Pin container memory to NUMA-local node via CpusetMems#11222
rapsealk wants to merge 10 commits into
mainfrom
perf/11217-numa-cpuset-mems

rapsealk commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Uh oh!

rapsealk commented Apr 22, 2026

Uh oh!

jopemachine Apr 22, 2026

Uh oh!

jopemachine Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Pin memory to the NUMA node only when the allocation is fully node-local;
		# Docker does not expose a multi-node cpuset.mems equivalent via HostConfig.

Conversation

rapsealk commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rapsealk commented Apr 22, 2026

Uh oh!

jopemachine Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

jopemachine Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rapsealk commented Apr 22, 2026 •

edited

Loading