Skip to content

scx_layered: StickyDynamic fails to allocate CPUs when initial LLC is saturated, despite hundreds of idle cores available on other LLCs #3233

@wongkq

Description

@wongkq

Issue Description
Summary I am running scx_layered on a large server (512 logical cores). I have configured two Confined layers for two VMs, both using the StickyDynamic growth algorithm. Crucially, I have NOT set any llcs or nodes constraints.

Both workloads (VMs) inherently start execution on the same NUMA node (likely Node 0) due to standard Linux behavior. The result: The first layer is successfully allocated 20 CPUs (saturating the local LLC). The second layer receives 0 CPUs and completely falls back to Open mode, even though the machine has vast amounts of idle resources on other NUMA nodes/LLCs.

It appears StickyDynamic is too conservative: when the local domain is full, it fails to "spill over" or migrate the workload to other empty nodes to satisfy the cpus_range minimum.

System Environment:

Hardware: Dual AMD EPYC 9745 (Zen5c)
Topology: 512 logical cores total (SMT enabled).
LLC Layout: 32 logical cores per LLC.
Kernel: 6.18.0-rc4
scx_layered version: 1.0.22-g460ed4d5

Configuration (config.json) No specific topology constraints are defined, only cpus_range.

[
  {
    "name": "test0",
    "matches": [
      [
        {
          "CgroupRegex": ".*test0.*vcpu.*"
        }
      ],
      [
        {
           "CgroupRegex": ".*test0.*emulator.*"
        }
      ]
    ],
    "kind": {
      "Confined": {
        "cpus_range": [20, 32],
        "util_range": [0.1, 1.0],
        "growth_algo": "StickyDynamic",
        "common": { "preempt": true, "slice_us": 2000 }
      }
    }
  },
  {
    "name": "test1",
    "matches": [
      [
        {
          "CgroupRegex": ".*test1.*vcpu.*"
        }
      ],
      [
        {
           "CgroupRegex": ".*test1.*emulator.*"
        }
      ]
    ],
    "kind": {
      "Confined": {
        "cpus_range": [20, 32],
        "util_range": [0.1, 1.0],
        "growth_algo": "StickyDynamic",
        "common": { "preempt": true, "slice_us": 2000 }
      }
    }
  },
  { "name": "default", "matches": [ [] ], "kind": { "Open": {} } }
]

Steps to Reproduce:

- Start two VMs (test0 and test1), each with 20 vCPUs.
- Apply a heavy load (e.g., 20 threads) inside both VMs.
- Observe that initially, both VMs are likely scheduled on the first few cores (Node 0) by the default kernel scheduler.
- Start scx_layered with the configuration above.

Observed Behavior:
Layer test06: Allocates 20 CPUs successfully (likely taking up most of LLC 0).
Layer test1: Fails to allocate any CPUs. Monitor shows cpus=0.

###### Thu, 15 Jan 2026 06:19:58 -0500 ######
tot= 349732 local_sel/enq=22.11/ 1.61 open_idle= 0.00 affn_viol= 0.00 hi/lo= 0.00/40.85
busy=  4.9 util/hi/lo= 2352.0/ 0.00/988.4 fallback_cpu/util= 10/ 0.0 proc=15ms sys_util_ewma=  2.7
excl_coll=0.00 excl_preempt=0.00 excl_idle=0.00 excl_wakeup=0.00
skip_preempt=0 antistall=0 fixup_vtime=0 preempting_mismatch=0
gpu_tasks_affinitized=0 gpu_task_affinitization_time=0
  test0: util/open/frac=1360.4/ 0.00/   57.8 prot/prot_preempt= 0.01/ 0.00 tasks=    40
             tot= 206443 local_sel/enq=37.28/ 2.72 enq_dsq=57.28 wake/exp/reenq=60.00/ 0.00/ 0.00 dsq_ewma=33.62
             keep/max/busy= 0.00/ 0.00/ 0.00 yield/ign= 0.00/    0
             open_idle= 0.00 mig=76.59 xnuma_mig= 0.00 xllc_mig/skip= 0.00/ 0.00 affn_viol= 0.00
             preempt/first/xllc/xnuma/idle/fail= 0.00/ 0.00/ 0.00/ 0.00/ 0.00/ 0.00
             xlayer_wake/re= 0.69/ 0.10 llc_drain/try= 0.00/ 0.00 skip_rnode= 0.00
             slice=20ms min_exec= 0.00/   0.00ms
             cpus= 20 [ 20, 20] 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000003ff,00000000,00000000,00000000,00000000,00000000,00000000,00000000,000003ff
             [LLC] nr_cpus: sched% lat_ms
             [000] 20:100.0%   0.18 |  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00
             [004]  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00
             [008]  0: 0.00%   0.31 |  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00
             [012]  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00
  test1: util/open/frac= 988.5/100.0/   42.0 prot/prot_preempt= 0.00/ 0.00 tasks=    45
             tot= 142861 local_sel/enq= 0.00/ 0.00 enq_dsq= 0.00 wake/exp/reenq=99.99/ 0.01/ 0.00 dsq_ewma= 0.00
             keep/max/busy= 0.00/ 0.00/ 0.00 yield/ign= 0.00/    0
             open_idle= 0.00 mig= 9.95 xnuma_mig= 0.00 xllc_mig/skip= 0.00/ 0.00 affn_viol= 0.00
             preempt/first/xllc/xnuma/idle/fail= 0.00/ 0.00/ 0.00/ 0.00/ 0.00/ 0.00
             xlayer_wake/re= 8.12/ 0.35 llc_drain/try= 0.00/ 0.00 skip_rnode= 0.00
             slice=20ms min_exec= 0.00/   0.00ms
             cpus=  0 [  0,  0] 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
             [LLC] nr_cpus: sched% lat_ms
             [000]  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00
             [004]  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00
             [008]  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00
             [012]  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00 |  0: 0.00%   0.00
  default  : util/open/frac=   3.1/ 1.07/    0.1 prot/prot_preempt= 0.01/58.08 tasks=   289
             tot=    428 local_sel/enq=88.55/ 2.80 enq_dsq= 0.00 wake/exp/reenq= 8.64/ 0.00/ 0.00 dsq_ewma= 0.08
             keep/max/busy= 0.00/ 0.00/ 0.00 yield/ign= 0.23/    0
             open_idle= 0.00 mig= 6.31 xnuma_mig= 0.00 xllc_mig/skip= 0.00/ 0.00 affn_viol= 0.00
             preempt/first/xllc/xnuma/idle/fail= 0.00/ 0.00/ 0.00/ 0.00/ 0.00/ 0.00
             xlayer_wake/re= 4.67/ 0.47 llc_drain/try= 0.00/ 0.00 skip_rnode= 0.00
             slice=20ms min_exec= 0.00/   0.00ms
             cpus=492 [492,492] ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,fffffc00,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,fffffc00
             [LLC] nr_cpus: sched% lat_ms
             [000]  0: 0.00%   0.73 |  0: 0.00%   0.45 |  0: 0.00%   0.00 |  0: 0.00%   0.31
             [004]  0: 0.00%   0.00 |  0: 0.00%   0.16 |  0: 0.00%   0.45 |  0: 0.00%   0.00
             [008]  0: 0.00%   0.87 |  0: 0.00%   0.87 |  0: 0.00%   0.31 |  0: 0.00%   0.60
             [012]  0: 0.00%   0.31 |  0: 0.00%   0.00 |  0: 0.00%   0.31 |  0: 0.00%   0.31

Expected Behavior Since cpus_range has a hard minimum of 20, and the machine has ~400 idle cores on other NUMA nodes/LLCs, StickyDynamic should detect that the current location is saturated and migrate the second layer to an empty LLC/Node. Instead, it seems to strictly adhere to the "Sticky" principle, sees 0 available capacity locally, and gives up allocation entirely.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions