Skip to content

[CPU] Bind threads and numa node for each TP rank #6549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

chunyuan-w
Copy link
Contributor

@chunyuan-w chunyuan-w commented May 23, 2025

Motivation

This PR binds threads and numa node for each TP rank when device is CPU, which will improvement the performance on CPU.
For example, when running --tp 6 on a machine with 6 sub-numa nodes, with this PR, rank 0 will be bound to core 0-39 on node 0, rank 1 to core 40-79 on node 1, etc.

python3 -m sglang.bench_one_batch --batch-size 1 --input 1024 --output 8 --model  deepseek-ai/DeepSeek-R1  --trust-remote-code --device cpu --tp 6
NUMA:                    
  NUMA node(s):          6
  NUMA node0 CPU(s):     0-39,240-279
  NUMA node1 CPU(s):     40-79,280-319
  NUMA node2 CPU(s):     80-119,320-359
  NUMA node3 CPU(s):     120-159,360-399
  NUMA node4 CPU(s):     160-199,400-439
  NUMA node5 CPU(s):     200-239,440-479

Modifications

  • A new OP init_cpu_threads_env is added in sgl-kernel/csrc/cpu/numa_utils.cpp to bind threads and numa node. UT for this OP is added in test/srt/cpu/test_binding.py.
  • sgl-kernel/csrc/cpu/CMakeLists.txt is updated to link the numa lib. conda install -y libnuma numactl is needed as a pre-requisite, which will be added in the Dockerfile for Xeon ([CPU] enable CI for PRs, add Dockerfile and auto build task #6458).
  • We don't need users to build vLLM CPU wheel anymore and updated the message in python/pyproject.toml.
  • In python/sglang/srt/model_executor/model_runner.py, we add the threads and numa node binding according to the tp_size. When SGLANG_CPU_OMP_THREADS_BIND is not set, we only support tp_size to be smaller than the number of sub-numa nodes on the current machine and we will bind each tp rank to one sub-numa node. We only bind to physical cores while logical cores are excluded. If this is not satisfied, an error message will be thrown.

@chunyuan-w chunyuan-w force-pushed the chunyuan/pr_core_bind branch from 105b412 to 7ea2474 Compare May 23, 2025 07:34
chunyuan-w added 14 commits May 23, 2025 15:34
…numa-dev` (sgl-project#74)

* port utils.cpp for numa binding from vllm into sglang

* fix build

* add pybind for init_cpu_threads_env

* use conda prefix to find libnuma # how to ensure libnuma is installed via conda

* replace init_cpu_threads_env in vllm with that in sgl-kernel

* add vllm into srt_cpu

* fix format
sgl-project#77)

* set local_omp_cpuid automatically

* set self.local_omp_cpuid in __init__

* refine warning message

* add more comments for example output of util functions

* add try except for lscpu

* refine warning message
@mingfeima mingfeima added intel cpu cpu backend performance optimization labels May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpu cpu backend performance optimization intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants