-
Notifications
You must be signed in to change notification settings - Fork 216
Description
Issue Description
Summary I am running scx_layered on a large server (512 logical cores). I have configured two Confined layers for two VMs, both using the StickyDynamic growth algorithm. Crucially, I have NOT set any llcs or nodes constraints.
Both workloads (VMs) inherently start execution on the same NUMA node (likely Node 0) due to standard Linux behavior. The result: The first layer is successfully allocated 20 CPUs (saturating the local LLC). The second layer receives 0 CPUs and completely falls back to Open mode, even though the machine has vast amounts of idle resources on other NUMA nodes/LLCs.
It appears StickyDynamic is too conservative: when the local domain is full, it fails to "spill over" or migrate the workload to other empty nodes to satisfy the cpus_range minimum.
System Environment:
Hardware: Dual AMD EPYC 9745 (Zen5c)
Topology: 512 logical cores total (SMT enabled).
LLC Layout: 32 logical cores per LLC.
Kernel: 6.18.0-rc4
scx_layered version: 1.0.22-g460ed4d5
Configuration (config.json) No specific topology constraints are defined, only cpus_range.
[
{
"name": "test0",
"matches": [
[
{
"CgroupRegex": ".*test0.*vcpu.*"
}
],
[
{
"CgroupRegex": ".*test0.*emulator.*"
}
]
],
"kind": {
"Confined": {
"cpus_range": [20, 32],
"util_range": [0.1, 1.0],
"growth_algo": "StickyDynamic",
"common": { "preempt": true, "slice_us": 2000 }
}
}
},
{
"name": "test1",
"matches": [
[
{
"CgroupRegex": ".*test1.*vcpu.*"
}
],
[
{
"CgroupRegex": ".*test1.*emulator.*"
}
]
],
"kind": {
"Confined": {
"cpus_range": [20, 32],
"util_range": [0.1, 1.0],
"growth_algo": "StickyDynamic",
"common": { "preempt": true, "slice_us": 2000 }
}
}
},
{ "name": "default", "matches": [ [] ], "kind": { "Open": {} } }
]
Steps to Reproduce:
- Start two VMs (test0 and test1), each with 20 vCPUs.
- Apply a heavy load (e.g., 20 threads) inside both VMs.
- Observe that initially, both VMs are likely scheduled on the first few cores (Node 0) by the default kernel scheduler.
- Start scx_layered with the configuration above.
Observed Behavior:
Layer test06: Allocates 20 CPUs successfully (likely taking up most of LLC 0).
Layer test1: Fails to allocate any CPUs. Monitor shows cpus=0.
###### Thu, 15 Jan 2026 06:19:58 -0500 ######
tot= 349732 local_sel/enq=22.11/ 1.61 open_idle= 0.00 affn_viol= 0.00 hi/lo= 0.00/40.85
busy= 4.9 util/hi/lo= 2352.0/ 0.00/988.4 fallback_cpu/util= 10/ 0.0 proc=15ms sys_util_ewma= 2.7
excl_coll=0.00 excl_preempt=0.00 excl_idle=0.00 excl_wakeup=0.00
skip_preempt=0 antistall=0 fixup_vtime=0 preempting_mismatch=0
gpu_tasks_affinitized=0 gpu_task_affinitization_time=0
test0: util/open/frac=1360.4/ 0.00/ 57.8 prot/prot_preempt= 0.01/ 0.00 tasks= 40
tot= 206443 local_sel/enq=37.28/ 2.72 enq_dsq=57.28 wake/exp/reenq=60.00/ 0.00/ 0.00 dsq_ewma=33.62
keep/max/busy= 0.00/ 0.00/ 0.00 yield/ign= 0.00/ 0
open_idle= 0.00 mig=76.59 xnuma_mig= 0.00 xllc_mig/skip= 0.00/ 0.00 affn_viol= 0.00
preempt/first/xllc/xnuma/idle/fail= 0.00/ 0.00/ 0.00/ 0.00/ 0.00/ 0.00
xlayer_wake/re= 0.69/ 0.10 llc_drain/try= 0.00/ 0.00 skip_rnode= 0.00
slice=20ms min_exec= 0.00/ 0.00ms
cpus= 20 [ 20, 20] 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000003ff,00000000,00000000,00000000,00000000,00000000,00000000,00000000,000003ff
[LLC] nr_cpus: sched% lat_ms
[000] 20:100.0% 0.18 | 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00
[004] 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00
[008] 0: 0.00% 0.31 | 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00
[012] 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00
test1: util/open/frac= 988.5/100.0/ 42.0 prot/prot_preempt= 0.00/ 0.00 tasks= 45
tot= 142861 local_sel/enq= 0.00/ 0.00 enq_dsq= 0.00 wake/exp/reenq=99.99/ 0.01/ 0.00 dsq_ewma= 0.00
keep/max/busy= 0.00/ 0.00/ 0.00 yield/ign= 0.00/ 0
open_idle= 0.00 mig= 9.95 xnuma_mig= 0.00 xllc_mig/skip= 0.00/ 0.00 affn_viol= 0.00
preempt/first/xllc/xnuma/idle/fail= 0.00/ 0.00/ 0.00/ 0.00/ 0.00/ 0.00
xlayer_wake/re= 8.12/ 0.35 llc_drain/try= 0.00/ 0.00 skip_rnode= 0.00
slice=20ms min_exec= 0.00/ 0.00ms
cpus= 0 [ 0, 0] 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
[LLC] nr_cpus: sched% lat_ms
[000] 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00
[004] 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00
[008] 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00
[012] 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00 | 0: 0.00% 0.00
default : util/open/frac= 3.1/ 1.07/ 0.1 prot/prot_preempt= 0.01/58.08 tasks= 289
tot= 428 local_sel/enq=88.55/ 2.80 enq_dsq= 0.00 wake/exp/reenq= 8.64/ 0.00/ 0.00 dsq_ewma= 0.08
keep/max/busy= 0.00/ 0.00/ 0.00 yield/ign= 0.23/ 0
open_idle= 0.00 mig= 6.31 xnuma_mig= 0.00 xllc_mig/skip= 0.00/ 0.00 affn_viol= 0.00
preempt/first/xllc/xnuma/idle/fail= 0.00/ 0.00/ 0.00/ 0.00/ 0.00/ 0.00
xlayer_wake/re= 4.67/ 0.47 llc_drain/try= 0.00/ 0.00 skip_rnode= 0.00
slice=20ms min_exec= 0.00/ 0.00ms
cpus=492 [492,492] ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,fffffc00,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,fffffc00
[LLC] nr_cpus: sched% lat_ms
[000] 0: 0.00% 0.73 | 0: 0.00% 0.45 | 0: 0.00% 0.00 | 0: 0.00% 0.31
[004] 0: 0.00% 0.00 | 0: 0.00% 0.16 | 0: 0.00% 0.45 | 0: 0.00% 0.00
[008] 0: 0.00% 0.87 | 0: 0.00% 0.87 | 0: 0.00% 0.31 | 0: 0.00% 0.60
[012] 0: 0.00% 0.31 | 0: 0.00% 0.00 | 0: 0.00% 0.31 | 0: 0.00% 0.31
Expected Behavior Since cpus_range has a hard minimum of 20, and the machine has ~400 idle cores on other NUMA nodes/LLCs, StickyDynamic should detect that the current location is saturated and migrate the second layer to an empty LLC/Node. Instead, it seems to strictly adhere to the "Sticky" principle, sees 0 available capacity locally, and gives up allocation entirely.