Note: This is an exploratory proposal. This direction was @johnbelamaric's idea, which came up during a discussion about scalability concerns. Wanted to get general feedback on this direction before exploring it further.
Background
The DRA CPU driver currently has 2 modes exposed via the --cpu-device-mode flag:
- Grouped Mode (
--cpu-device-mode=grouped): Exposes CPUs grouped by topology boundaries (e.g., NUMA nodes, sockets). The driver remains in control of selecting which specific CPU cores within the boundaries of the selected device are allocated to the container.
- Individual Mode (
--cpu-device-mode=individual): Exposes every CPU core as an distinct device in the ResourceSlice. The main benefit this provides it allows users, or external schedulers to perform fine-grained selection of specific CPU cores at the claim level.
Individual mode while it technically works with default kube-scheduler the scalability challenges prevent it from being used broadly. For all practical purposes is primarily meant for external scheduler use-cases where there is a need to select exact CPUs. This mode presents the below scalability challenges
- Kube-scheduler allocation logic does not scale well with 100's of devices per node/ResourceSlice. This performance bottleneck occurs when a claim requests a general count of devices (e.g., "give me any 4 CPUs") rather than requesting specific device instances (Eg: "give me CPUS 0-3"). This forces the kube-scheduler's allocation algorithm to search for and choose available devices and this performs poorly at scale. This is the main reason why
individual mode is recommended only with external schedulers.
- Requesting a large number of devices in a claim easily hits the DRA default claim limit of 32 devices (Issue #5717). We need to split up requests into multiple claims or increase the default value.
- Large machines easily hit limits on the maximum number of devices stored per ResourceSlice (Issue #5718). We need to split into multiple ResourcesSlices.
- More etcd storage.
Additionally, maintaining the two modes creates two separate code paths with very little overlap. #112 has more discussion and it was agreed that individual mode can be split as a seperate driver.
Alternate Proposal
We could use grouped mode with the below changes as an alternative to Individual mode to achieve fine grain allocation:
- Along with currently supported groupings (NUMA node, socket), we introduce a new configuration option:
group-by=machine. In this mode, the DRA driver exposes all CPUs on the node as a single device as a consumable capacity.
Example ResourceSlice:
devices:
- name: cpudevmachine
allowMultipleAllocations: true
capacity:
dra.cpu/cpu:
value: "256"
- When
group-by=machine is configured, external schedulers can inject specific CPU requirements into the claim's status allocation configuration at bind/allocation time using status.devices.config.opaque. The DRA driver parses this and enforces the CPUMask instead of running the allocation logic.
Example Claim.
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: claim-cpu-10
spec:
devices:
requests:
- name: req-cpu
exactly:
deviceClassName: dra.cpu
capacity:
requests:
dra.cpu/cpu: "10"
status:
allocation:
devices:
. . .
config: # Added by external scheduler during binding
- source: FromClaim
requests:
- req-cpu
opaque:
driver: dra.cpu
parameters:
cpuset: "2-11"
- DRA driver behavior in
group-by=machine:
When opaque is not specified: Fall back to DRA driver allocating CPUs based on claim request [Edit]: With machine grouping we always require opaque config passed. More discussion in the implementation PR.
- When opaque is specified: Validate and allocate the specified cores directly:
- Validation: The driver verifies that the custom cpuset is valid for the host machine and is currently allocatable:
- The number of CPUs in
cpuset must exactly match what is requested in the claim (dra.cpu/cpu: "10").
cpuset is valid for the node.
cpuset is not already reserved using the driver's --reserved-cpus configuration flag.
cpuset is not already already allocated to any other active claims.
- Error Handling: If any of the above validation fails, the driver returns a failure immediately in Kubelet's
PrepareResourceClaims hook, causing pod startup to fail.
Proposal: Can we remove Individual mode and use the mechanism described above for fine-grained CPU allocation ?
Pros
- No scalability concerns stemming from device count.
- Simplifies driver logic by avoiding having to maintain 2 code paths.
Cons
- Without individual mode, we cannot request for specific CPUs through claims using the default kube-scheduler.
- If multiple claims have overlapping CPUs, validation failure occurs on the node during the
PrepareResources phase (after the pod is scheduled), leaving the pod stuck on the node (This scenario likely represents a bug in the external scheduler code though).
- (Maybe more that I havn't thought through)
Note: This is an exploratory proposal. This direction was @johnbelamaric's idea, which came up during a discussion about scalability concerns. Wanted to get general feedback on this direction before exploring it further.
Background
The DRA CPU driver currently has 2 modes exposed via the
--cpu-device-modeflag:--cpu-device-mode=grouped): Exposes CPUs grouped by topology boundaries (e.g., NUMA nodes, sockets). The driver remains in control of selecting which specific CPU cores within the boundaries of the selected device are allocated to the container.--cpu-device-mode=individual): Exposes every CPU core as an distinct device in theResourceSlice. The main benefit this provides it allows users, or external schedulers to perform fine-grained selection of specific CPU cores at the claim level.Individual mode while it technically works with default kube-scheduler the scalability challenges prevent it from being used broadly. For all practical purposes is primarily meant for external scheduler use-cases where there is a need to select exact CPUs. This mode presents the below scalability challenges
individualmode is recommended only with external schedulers.Additionally, maintaining the two modes creates two separate code paths with very little overlap. #112 has more discussion and it was agreed that individual mode can be split as a seperate driver.
Alternate Proposal
We could use
groupedmode with the below changes as an alternative to Individual mode to achieve fine grain allocation:group-by=machine. In this mode, the DRA driver exposes all CPUs on the node as a single device as a consumable capacity.Example ResourceSlice:
group-by=machineis configured, external schedulers can inject specific CPU requirements into the claim's status allocation configuration at bind/allocation time usingstatus.devices.config.opaque. The DRA driver parses this and enforces the CPUMask instead of running the allocation logic.Example Claim.
group-by=machine:When opaque is not specified: Fall back to DRA driver allocating CPUs based on claim request[Edit]: With machine grouping we always require opaque config passed. More discussion in the implementation PR.cpusetmust exactly match what is requested in the claim (dra.cpu/cpu: "10").cpusetis valid for the node.cpusetis not already reserved using the driver's--reserved-cpusconfiguration flag.cpusetis not already already allocated to any other active claims.PrepareResourceClaimshook, causing pod startup to fail.Proposal: Can we remove
Individualmode and use the mechanism described above for fine-grained CPU allocation ?Pros
Cons
PrepareResourcesphase (after the pod is scheduled), leaving the pod stuck on the node (This scenario likely represents a bug in the external scheduler code though).