Proposal: Opaque config in ResourceClaim status instead of Individual Mode

**Note:** This is an exploratory proposal. This direction was @johnbelamaric's idea, which came up during a discussion about scalability concerns. Wanted to get general feedback on this direction before exploring it further.

# Background

The DRA CPU driver currently has 2 modes exposed via the `--cpu-device-mode` flag:
1. Grouped Mode (`--cpu-device-mode=grouped`): Exposes CPUs grouped by topology boundaries (e.g., NUMA nodes, sockets). The driver remains in control of selecting which specific CPU cores within the boundaries of the selected device are allocated to the container.
2. Individual Mode (`--cpu-device-mode=individual`): Exposes every CPU core as an distinct device in the `ResourceSlice`. The main benefit this provides it allows users,  or external schedulers to perform fine-grained selection of specific CPU cores at the claim level.

Individual mode while it technically works with default kube-scheduler the scalability challenges prevent it from being used broadly. For all practical purposes is primarily meant for external scheduler use-cases where there is a need to select exact CPUs. This mode presents the below scalability challenges 
1.  Kube-scheduler allocation logic does not scale well with 100's of devices per node/ResourceSlice. This performance bottleneck occurs when a claim requests a general count of devices (e.g., "give me any 4 CPUs") rather than requesting specific device instances (Eg: "give me CPUS 0-3"). This forces the kube-scheduler's allocation algorithm to search for and choose available devices and this performs poorly at scale. This is the main reason why `individual` mode is recommended only with external schedulers.
2. Requesting a large number of devices in a claim easily hits the DRA default claim limit of 32 devices (Issue [#5717](https://github.com/kubernetes/enhancements/issues/5717)). We need to split up requests into multiple claims or increase the default value.
3. Large machines easily hit limits on the maximum number of devices stored per ResourceSlice (Issue [#5718](https://github.com/kubernetes/enhancements/issues/5718)). We need to split into multiple ResourcesSlices. 
4. More etcd storage. 

Additionally, maintaining the two modes creates two separate code paths with very little overlap. https://github.com/kubernetes-sigs/dra-driver-cpu/issues/112 has more discussion and it was agreed that individual mode can be split as a seperate driver.

# Alternate Proposal

We could use `grouped` mode with the below changes as an alternative to Individual mode to achieve fine grain allocation:

1.  Along with currently supported groupings (NUMA node, socket), we introduce a new configuration option: `group-by=machine`. In this mode, the DRA driver exposes all CPUs on the node as a single device as a consumable capacity.

Example ResourceSlice:
```
devices:
    - name: cpudevmachine
      allowMultipleAllocations: true
      capacity:
        dra.cpu/cpu:
          value: "256"
```

2. When `group-by=machine` is configured, external schedulers can inject specific CPU requirements into the claim's status allocation configuration at bind/allocation time using `status.devices.config.opaque`. The DRA driver parses this and enforces the CPUMask instead of running the allocation logic.


Example Claim.
```yaml
  apiVersion: resource.k8s.io/v1
  kind: ResourceClaim
  metadata:
    name: claim-cpu-10
  spec:
    devices:
      requests:
      - name: req-cpu
        exactly:
          deviceClassName: dra.cpu
          capacity:
            requests:
              dra.cpu/cpu: "10"
  status:
    allocation:
      devices:
        . . .
        config:  # Added by external scheduler during binding
        - source: FromClaim
          requests:
          - req-cpu
          opaque:
            driver: dra.cpu
            parameters:
              cpuset: "2-11"
  ```

* DRA driver behavior in `group-by=machine`:
  * ~~**When opaque is not specified**: Fall back to DRA driver allocating CPUs based on claim request~~ [Edit]: With machine grouping we always require opaque config passed. More discussion in the implementation PR.
  * **When opaque is specified**: Validate and allocate the specified cores directly:
    * **Validation**: The driver verifies that the custom cpuset is valid for the host machine and is currently allocatable:
      * The number of CPUs in `cpuset` must exactly match what is requested in the claim (`dra.cpu/cpu: "10"`).
      * `cpuset` is valid for the node.
      * `cpuset` is not already reserved using the driver's `--reserved-cpus` configuration flag.
      * `cpuset` is not already already allocated to any other active claims.
    * **Error Handling**: If any of the above validation fails, the driver returns a failure immediately in Kubelet's `PrepareResourceClaims` hook, causing pod startup to fail.

**Proposal: Can we remove `Individual` mode and use the mechanism described above for fine-grained CPU allocation ?**

## Pros
* No scalability concerns stemming from device count.
* Simplifies driver logic by avoiding having to maintain 2 code paths.

## Cons
* Without individual mode, we cannot request for specific CPUs through claims using the default kube-scheduler. 
* If multiple claims have overlapping CPUs, validation failure occurs on the node during the `PrepareResources` phase (after the pod is scheduled), leaving the pod stuck on the node (This scenario likely represents a bug in the external scheduler code though).
* (Maybe more that I havn't thought through)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Opaque config in ResourceClaim status instead of Individual Mode #164

Background

Alternate Proposal

Pros

Cons

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Proposal: Opaque config in ResourceClaim status instead of Individual Mode #164

Description

Background

Alternate Proposal

Pros

Cons

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions