made changes to reflect feedbacks from comments.

yliaog · yliaog · commit 3867a9ba3e60 · 2025-02-12T01:40:13.000Z
diff --git a/keps/sig-scheduling/5004-dra-extended-resource/README.md b/keps/sig-scheduling/5004-dra-extended-resource/README.md
@@ -9,9 +9,15 @@
 - [Proposal](#proposal)
 - [Design Details](#design-details)
   - [Resource Slice API](#resource-slice-api)
+  - [Device Class API](#device-class-api)
   - [Resource Claim API](#resource-claim-api)
   - [Pod API](#pod-api)
   - [Scheduling for Dynamic Extended Resource](#scheduling-for-dynamic-extended-resource)
+    - [EventsToRegister](#eventstoregister)
+    - [Score](#score)
+    - [Reserve](#reserve)
+    - [Prebind](#prebind)
+    - [Unreserve](#unreserve)
   - [Actuation for Dynamic Extended Resource](#actuation-for-dynamic-extended-resource)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
@@ -97,8 +103,8 @@ the node is dynamically allocated to the pod, with the remaining 7 GPUs left for
 allocation for future requests from either extended resource, or DRA resource claim.
 
 Note that another node in the same cluster has installed device plugin, which
-may have advertised e.g. 'nvidia.com/gpu: 2' in its `Node`'s Capacity. The same
-`Deployment` can possibly be scheduled and run on this node too.
+may have advertised e.g. 'example.com/gpu: 2' in its `Node`'s Capacity. The same
+`Deployment` can possibly be scheduled and run on that node too.
 
 ```yaml
 apiVersion: apps/v1
@@ -122,14 +128,14 @@ spec:
         args: ["nvidia-smi && tail -f /dev/null"]
         resources:
           limits:
-            nvidia.com/gpu: 1
+            example.com/gpu: 1
 ```
 
 ```yaml
 apiVersion: resource.k8s.io/v1beta1
 kind: ResourceSlice
 metadata:
-  name: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2-gpu.nvidia.coqj92d
+  name: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2-gpu.coqj92d
 spec:
   devices:
   - basic:
@@ -148,9 +154,9 @@ spec:
     name: gpu-6
   - basic:
     name: gpu-7
-  driver: gpu.nvidia.com
+  driver: gpu.example.com
   nodeName: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2
-  extendedResourceName: nvidia.com/gpu
+  extendedResourceName: example.com/gpu
 ```
 
 ```yaml
@@ -181,7 +187,7 @@ status:
     hugepages-2Mi: "0"
     memory: 15335536Ki
     pods: "110"
-    nvidia.com/gpu: 2
+    example.com/gpu: 2
 ```
 
 
@@ -190,8 +196,8 @@ non-goals of this KEP.
 
 ### Goals
 
-* Introduce the ability for DRA to advertise extended resources listed in a
-  ResourceSlice, and for the scheduler to consider them for allocation.
+* Introduce the ability for DRA to advertise extended resources, and for the
+  scheduler to consider them for allocation.
 
 * Enable application operators to use the existing extended resource request in
   pod spec to request for DRA resources.
@@ -200,10 +206,17 @@ non-goals of this KEP.
   for the short term. Its ease of use is one big advantage to keep it remaining
   useful for the long term.
 
+* Device plugin API must not change. The existing device plugin drivers must
+  continue working without change.
+
+* DRA driver API change must be minimal, if there is any. Core kubernetes
+  (kube-scheduler, kubelet) is preferred over DRA driver for any change needed
+  to support the feature.
+
 ### Non-Goals
 
-* Simplify DRA driver developement. The DRA driver needs to support both DRA
-  and extended resource API. This KEP adds complexity and cost to the driver.
+* Minimize kubelet or kube-scheduler changes. The feature requires necessary
+  changes in both scheduling and actuation.
 
 ## Proposal
 
@@ -241,7 +254,7 @@ resource, and dynamic extended resource.
   `ResourceSlice` to provide resource capacity. A pod asks for resources through
   resource claim requests in pod's spec.resources.claims. Dynamic resource type
   is described in resource slice, simply speaking, it is a list of devices, with
-  each device being described as structured paramaters.
+  each device being described as structured parameters.
 * dynamic extended resource is a combination of the two above. It uses pods'
   spec.containers[].resources.requests to request for resources, and uses
   `ResourceSlice` to provide resource capacity. Hence, it is of type: string,
@@ -312,24 +325,123 @@ advertised as the given extended resource name. If a device has a different
 extended resource name than that given in the `ResoureSlice`, the device's
 extended resource name is used for that device.
 
+### Device Class API
+The extended resource name to device mapping can be specified at
+`DeviceClassSpec`, instead of at the `Device` in `ResourceSlice` API as shown in
+the section above. The same extended resource name can be given to different
+device classes, and one device class can have at most one extended resource name.
+
+```go
+// DeviceClassSpec is used in a [DeviceClass] to define what can be allocated
+// and how to configure it.
+type DeviceClassSpec struct {
+	// Each selector must be satisfied by a device which is claimed via this class.
+	//
+	// +optional
+	// +listType=atomic
+	Selectors []DeviceSelector `json:"selectors,omitempty" protobuf:"bytes,1,opt,name=selectors"`
+
+	// Config defines configuration parameters that apply to each device that is claimed via this class.
+	// Some classses may potentially be satisfied by multiple drivers, so each instance of a vendor
+	// configuration applies to exactly one driver.
+	//
+	// They are passed to the driver, but are not considered while allocating the claim.
+	//
+	// +optional
+	// +listType=atomic
+	Config []DeviceClassConfiguration `json:"config,omitempty" protobuf:"bytes,2,opt,name=config"`
+
+	// ExtendedResourceName is the extended resource name
+	// the device class is advertised as. It must be a DNS label.
+	// All devices matched by the device class can be used to satisfy the
+	// extended resource requests in pod's spec.
+	//
+	// +optional
+	ExtendedResourceName string
+}
+```
+
 ### Resource Claim API
-There is no API change on `ResourceClaim`, i.e. no new API type. However, a special
-resource claim object is created to keep track of device allocations for dyanmic
-extended resource. The special resource claim object has following properties:
+A new field `ExtendedResults` of type
+`DeviceExtendedResourceRequestAllocationResult` is added to hold the allocated
+devices for the extended resources. The existing field `Results` cannot be
+reused directly without breaking downgrade.
+
+```go
+// DeviceAllocationResult is the result of allocating devices.
+type DeviceAllocationResult struct {
+	// Results lists all allocated devices.
+	//
+	// +optional
+	// +listType=atomic
+	Results []DeviceRequestAllocationResult
+	// ExtendedResults lists all allocated devices for extended resource
+	// requests.
+	//
+	// +optional
+	// +listType=atomic
+	ExtendedResults []DeviceExtendedResourceRequestAllocationResult
+}
+
+// DeviceExtendedResourceRequestAllocationResult contains the allocation result
+// for extended resource request.
+type DeviceExtendedResourceRequestAllocationResult struct {
+	// ExtendedResourceName is the extended resource name the devices are
+	// allocated for.
+	//
+	// +required
+	ExtendedResourceName string `json:"extendedResourceName" protobuf:"bytes,1,name=extendedResourceName"`
+
+	// Driver specifies the name of the DRA driver whose kubelet
+	// plugin should be invoked to process the allocation once the claim is
+	// needed on a node.
+	//
+	// Must be a DNS subdomain and should end with a DNS domain owned by the
+	// vendor of the driver.
+	//
+	// +required
+	Driver string `json:"driver" protobuf:"bytes,2,name=driver"`
+
+	// This name together with the driver name and the device name field
+	// identify which device was allocated (`<driver name>/<pool name>/<device name>`).
+	//
+	// Must not be longer than 253 characters and may contain one or more
+	// DNS sub-domains separated by slashes.
+	//
+	// +required
+	Pool string `json:"pool" protobuf:"bytes,3,name=pool"`
+
+	// Device references one device instance via its name in the driver's
+	// resource pool. It must be a DNS label.
+	//
+	// +required
+	Device string `json:"device" protobuf:"bytes,4,name=device"`
+
+	// AdminAccess indicates that this device was allocated for
+	// administrative access. See the corresponding request field
+	// for a definition of mode.
+	//
+	// This is an alpha field and requires enabling the DRAAdminAccess
+	// feature gate. Admin access is disabled if this field is unset or
+	// set to false, otherwise it is enabled.
+	//
+	// +optional
+	// +featureGate=DRAAdminAccess
+	AdminAccess *bool `json:"adminAccess" protobuf:"bytes,5,name=adminAccess"`
+}
+```
+
+A special resource claim object is created to keep track of device allocations for
+extended resource. The resource claim object has following properties:
 
   * It is namespace scoped, like other resource claim objects.
   * It is owned by a pod, like other resource claim objects.
   * It has null `spec`.
   * Its `status.allocation.devices` and `status.allocation.reservedFor` are
     used.
-  * It has annotation `resource.kubernetes.io/extended-resource-name:`, and it
-    does not have annotation `resource.kubernetes.io/pod-claim-name:`
-
-```yaml
-metadata:
-  annotations:
-    resource.kubernetes.io/extended-resource-name: foo.domain/bar
-```
+  * It does not have annotation `resource.kubernetes.io/pod-claim-name:` as
+    it is created for the extended resource request in a pod spec, not for a
+    claim in the pod spec.
 
 The special resource claim object lifecycle is managed by the scheduler and
 garbage collector.
@@ -338,36 +450,48 @@ garbage collector.
   request, and the extended resource is advertised by `ResourceSlice` and
   scheduler has fit the pod to a node with the `ResourceSlice`.
   * It is *created* by the scheduler dynamic extended resource plugin during
-    pre-bind phase. The in-memory one in the assumed cache is created earlier
+    preBind phase. The in-memory one in the assumed cache is created earlier
     during reserve phase.
   * It is *deleted* together with the owning pod's deletion.
+  * It is *deleted* by the scheduler dynamic extended resource plugin during
+    unReserve phase.
   * It is *read* by scheduler dynamic resource plugin for the devices allocated,
   so that the scheduler remove considerations for allocation of these devices for
   other DRA resource claim requests in 'dynamic resource plugin'.
   * It is *read* by the kubelet DRA device driver to prepare the devices listed
     therein when preparing to run the pod.
 
 ### Pod API
-There is no API change on `Pod`. Pod's status.resourceClaimStatuses tracks the
-special resouceclaim object created for the dynamic extended resource requests
-in the pod. The dynamic extended resource name is used in the status. For
-example, if a pod has requested for foo.domain/bar, and it is scheduled to run
-on a node that has advertised foo.domain/bar in `ResourceSlice`, then the pod's
-status is like below:
+
+A new field `extendedResourceClaimStatuses` is added to Pod's status to track
+the special resouceclaim object created for the dynamic extended resource requests
+in the pod. The dynamic extended resource name is used in the status. For example,
+if a pod has requested for foo.domain/bar, and it is scheduled to run on a node
+that has advertised foo.domain/bar in `ResourceSlice`, then the pod's status is
+like below:
 
 
 ```yaml
 status:
-   resourceClaimStatuses:
+   extendedResourceClaimStatuses:
    - name: foo.domain/bar
      resourceClaimName: ccc-gpu-57999b9c4c-vpq68-gpu-8s27z
 ```
+Note the validations for extendedResourceClaimStatuses are different from the
+validations for resourceClaimStatuses.
+
+1. resourceClaimStatuses requires `name` must be DNS label,
+   extendedResourceClaimStatuses's name does not need to be DNS label.
+1. resourceClaimStatuses requires `name` must be one of the claim's name in the
+   pod spec. extendedResourceClaimStatuses requires `name` must be one of the
+   extended resource name in the pod spec.
 
 ### Scheduling for Dynamic Extended Resource
 
-A new field `DynamicResources` is added to `Resource`, it works similar to
-ScalarResources. It is used to keep track of the dynamic extended resources on a
-node, i.e. those that are advertised by `ResourceSlice`.
+A new field `DynamicResources` is added to
+[`Resource`](https://github.com/kubernetes/kubernetes/blob/c81431de59a3bf516489317433a165b050322339/pkg/scheduler/framework/types.go#L798),
+it works similar to ScalarResources. It is used to keep track of the dynamic extended
+resources on a node, i.e. those that are advertised by `ResourceSlice`.
 
 ```go
 type Resource struct {
@@ -380,19 +504,28 @@ type Resource struct {
 	// ScalarResources
 	ScalarResources map[v1.ResourceName]int64
 
-    // NEW!
-    // DynamicResources: keep track of dynamic extended resources
+	// NEW!
+	// DynamicResources: keep track of dynamic extended resources
 	DynamicResources map[v1.ResourceName]int64
 }
 ```
 
 type `NodeInfo` is used by scheduler to keep track of the information for each
 node in memory. Its `Allocatable` field is used to keep track of the allocatable
-resources in memory. For a node with extended resources, its NodeInfo's
-Allocatable.ScalarResources is updated with the `Node`'s informer, minus the
-used. For a node with dynamic extended resources, its NodeInfo's
-Allocatable.DynamicResources is updated with the `ResourceSlice`'s informer,
-minus used by either dynamic extended resource or resource claims.
+resources in memory. At the beginning of each scheduling cycle, scheduler takes
+a snapshot of all the nodes in the cluster, and updates their corresponding
+`NodeInfo`.
+
+For the scheduler with DRA enabled, right after taking the node snapshot, the
+scheduler also takes a snapshot of `ResoureClaims` and `ResourceSlice`, and
+updates `NodeInfo.DynamicResources` if the node has resources backed by DRA
+`ResourceSlice`.
+
+For a node with extended resources, its NodeInfo's
+Allocatable.ScalarResources is updated with the k8s `Node`'s object.
+For a node with dynamic extended resources, its NodeInfo's
+Allocatable.DynamicResources is updated based on DRA `ResourceSlice` and
+`ResourceClaim` objects.
 
 The existing 'noderesources' plugin needs to be modified, such that a pod's
 extended resource request is checked against a NodeInfo's ScalarResources if the
@@ -439,7 +572,7 @@ is scheduled to run, the following are particularly important:
 1. Kubelet tries to admit the pod, the pod's dynamic extended resources requests
    should not be checked against the `Node`'s allocatable, as the resources are
    in `ResourceSlice`, not in `Node`. Instead, kubelet needs to follow the admit
-   process for the speical `ResourceClaim`.
+   process for the special `ResourceClaim`.
 
 1. Kubelet passes the special `ResoureClaim` to DRA driver to prepare the
    devices, in the same way as that for normal `ResourceClaim`.