[CPU] Bind threads and numa node for each TP rank #6549

chunyuan-w · 2025-05-23T06:36:52Z

Motivation

This PR binds threads and numa node for each TP rank when device is CPU, which will improvement the performance on CPU.
For example, when running --tp 6 on a machine with 6 sub-numa nodes, with this PR, rank 0 will be bound to core 0-39 on node 0, rank 1 to core 40-79 on node 1, etc.

python3 -m sglang.bench_one_batch --batch-size 1 --input 1024 --output 8 --model  deepseek-ai/DeepSeek-R1  --trust-remote-code --device cpu --tp 6

NUMA:                    
  NUMA node(s):          6
  NUMA node0 CPU(s):     0-39,240-279
  NUMA node1 CPU(s):     40-79,280-319
  NUMA node2 CPU(s):     80-119,320-359
  NUMA node3 CPU(s):     120-159,360-399
  NUMA node4 CPU(s):     160-199,400-439
  NUMA node5 CPU(s):     200-239,440-479

Modifications

A new OP init_cpu_threads_env is added in sgl-kernel/csrc/cpu/numa_utils.cpp to bind threads and numa node. UT for this OP is added in test/srt/cpu/test_binding.py.
sgl-kernel/csrc/cpu/CMakeLists.txt is updated to link the numa lib. conda install -y libnuma numactl is needed as a pre-requisite, which will be added in the Dockerfile for Xeon ([CPU] enable CI for PRs, add Dockerfile and auto build task #6458).
We don't need users to build vLLM CPU wheel anymore and updated the message in python/pyproject.toml.
In python/sglang/srt/model_executor/model_runner.py, we add the threads and numa node binding according to the tp_size. When SGLANG_CPU_OMP_THREADS_BIND is not set, we only support tp_size to be smaller than the number of sub-numa nodes on the current machine and we will bind each tp rank to one sub-numa node. We only bind to physical cores while logical cores are excluded. If this is not satisfied, an error message will be thrown.

sgl-kernel/csrc/cpu/torch_extension_cpu.cpp

…numa-dev` (sgl-project#74) * port utils.cpp for numa binding from vllm into sglang * fix build * add pybind for init_cpu_threads_env * use conda prefix to find libnuma # how to ensure libnuma is installed via conda * replace init_cpu_threads_env in vllm with that in sgl-kernel * add vllm into srt_cpu * fix format

sgl-project#77) * set local_omp_cpuid automatically * set self.local_omp_cpuid in __init__ * refine warning message * add more comments for example output of util functions * add try except for lscpu * refine warning message

…ot have tensor input

chunyuan-w commented May 23, 2025

View reviewed changes

sgl-kernel/csrc/cpu/torch_extension_cpu.cpp Outdated Show resolved Hide resolved

chunyuan-w force-pushed the chunyuan/pr_core_bind branch from 105b412 to 7ea2474 Compare May 23, 2025 07:34

chunyuan-w added 14 commits May 23, 2025 15:34

bind OpenMP threads to CPU cores (sgl-project#4)

5891e3c

use torch.ops

6629361

add numa related change in CMakeList

7adbd8a

use CatchAll as dispatch key for init_cpu_threads_env since it does n…

1117013

…ot have tensor input

revert changes on setup_cpu.py as it will be removed

2b7b6a8

remove vllm as it's not a must on latest main

18bddad

move binding code to init_threads_binding

d54d108

fix format

8c1dd95

replace gettid with syscall (sgl-project#78)

412d06b

fix format

8839bcf

add check if libnuma is in conda dir

6f6a8f0

add UT

7ea2474

mingfeima added intel cpu cpu backend performance optimization labels May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU] Bind threads and numa node for each TP rank #6549

[CPU] Bind threads and numa node for each TP rank #6549

Uh oh!

chunyuan-w commented May 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[CPU] Bind threads and numa node for each TP rank #6549

Are you sure you want to change the base?

[CPU] Bind threads and numa node for each TP rank #6549

Uh oh!

Conversation

chunyuan-w commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Uh oh!

Uh oh!

Uh oh!

chunyuan-w commented May 23, 2025 •

edited

Loading