Skip to content

Proposal: Opaque config in ResourceClaim status instead of Individual Mode #164

Description

@pravk03

Note: This is an exploratory proposal. This direction was @johnbelamaric's idea, which came up during a discussion about scalability concerns. Wanted to get general feedback on this direction before exploring it further.

Background

The DRA CPU driver currently has 2 modes exposed via the --cpu-device-mode flag:

  1. Grouped Mode (--cpu-device-mode=grouped): Exposes CPUs grouped by topology boundaries (e.g., NUMA nodes, sockets). The driver remains in control of selecting which specific CPU cores within the boundaries of the selected device are allocated to the container.
  2. Individual Mode (--cpu-device-mode=individual): Exposes every CPU core as an distinct device in the ResourceSlice. The main benefit this provides it allows users, or external schedulers to perform fine-grained selection of specific CPU cores at the claim level.

Individual mode while it technically works with default kube-scheduler the scalability challenges prevent it from being used broadly. For all practical purposes is primarily meant for external scheduler use-cases where there is a need to select exact CPUs. This mode presents the below scalability challenges

  1. Kube-scheduler allocation logic does not scale well with 100's of devices per node/ResourceSlice. This performance bottleneck occurs when a claim requests a general count of devices (e.g., "give me any 4 CPUs") rather than requesting specific device instances (Eg: "give me CPUS 0-3"). This forces the kube-scheduler's allocation algorithm to search for and choose available devices and this performs poorly at scale. This is the main reason why individual mode is recommended only with external schedulers.
  2. Requesting a large number of devices in a claim easily hits the DRA default claim limit of 32 devices (Issue #5717). We need to split up requests into multiple claims or increase the default value.
  3. Large machines easily hit limits on the maximum number of devices stored per ResourceSlice (Issue #5718). We need to split into multiple ResourcesSlices.
  4. More etcd storage.

Additionally, maintaining the two modes creates two separate code paths with very little overlap. #112 has more discussion and it was agreed that individual mode can be split as a seperate driver.

Alternate Proposal

We could use grouped mode with the below changes as an alternative to Individual mode to achieve fine grain allocation:

  1. Along with currently supported groupings (NUMA node, socket), we introduce a new configuration option: group-by=machine. In this mode, the DRA driver exposes all CPUs on the node as a single device as a consumable capacity.

Example ResourceSlice:

devices:
    - name: cpudevmachine
      allowMultipleAllocations: true
      capacity:
        dra.cpu/cpu:
          value: "256"
  1. When group-by=machine is configured, external schedulers can inject specific CPU requirements into the claim's status allocation configuration at bind/allocation time using status.devices.config.opaque. The DRA driver parses this and enforces the CPUMask instead of running the allocation logic.

Example Claim.

  apiVersion: resource.k8s.io/v1
  kind: ResourceClaim
  metadata:
    name: claim-cpu-10
  spec:
    devices:
      requests:
      - name: req-cpu
        exactly:
          deviceClassName: dra.cpu
          capacity:
            requests:
              dra.cpu/cpu: "10"
  status:
    allocation:
      devices:
        . . .
        config:  # Added by external scheduler during binding
        - source: FromClaim
          requests:
          - req-cpu
          opaque:
            driver: dra.cpu
            parameters:
              cpuset: "2-11"
  • DRA driver behavior in group-by=machine:
    • When opaque is not specified: Fall back to DRA driver allocating CPUs based on claim request [Edit]: With machine grouping we always require opaque config passed. More discussion in the implementation PR.
    • When opaque is specified: Validate and allocate the specified cores directly:
      • Validation: The driver verifies that the custom cpuset is valid for the host machine and is currently allocatable:
        • The number of CPUs in cpuset must exactly match what is requested in the claim (dra.cpu/cpu: "10").
        • cpuset is valid for the node.
        • cpuset is not already reserved using the driver's --reserved-cpus configuration flag.
        • cpuset is not already already allocated to any other active claims.
      • Error Handling: If any of the above validation fails, the driver returns a failure immediately in Kubelet's PrepareResourceClaims hook, causing pod startup to fail.

Proposal: Can we remove Individual mode and use the mechanism described above for fine-grained CPU allocation ?

Pros

  • No scalability concerns stemming from device count.
  • Simplifies driver logic by avoiding having to maintain 2 code paths.

Cons

  • Without individual mode, we cannot request for specific CPUs through claims using the default kube-scheduler.
  • If multiple claims have overlapping CPUs, validation failure occurs on the node during the PrepareResources phase (after the pod is scheduled), leaving the pod stuck on the node (This scenario likely represents a bug in the external scheduler code though).
  • (Maybe more that I havn't thought through)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions