Skip to content

Commit 3867a9b

Browse files
committed
made changes to reflect feedbacks from comments.
1 parent 5f074f6 commit 3867a9b

File tree

1 file changed

+175
-42
lines changed
  • keps/sig-scheduling/5004-dra-extended-resource

1 file changed

+175
-42
lines changed

keps/sig-scheduling/5004-dra-extended-resource/README.md

+175-42
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,15 @@
99
- [Proposal](#proposal)
1010
- [Design Details](#design-details)
1111
- [Resource Slice API](#resource-slice-api)
12+
- [Device Class API](#device-class-api)
1213
- [Resource Claim API](#resource-claim-api)
1314
- [Pod API](#pod-api)
1415
- [Scheduling for Dynamic Extended Resource](#scheduling-for-dynamic-extended-resource)
16+
- [EventsToRegister](#eventstoregister)
17+
- [Score](#score)
18+
- [Reserve](#reserve)
19+
- [Prebind](#prebind)
20+
- [Unreserve](#unreserve)
1521
- [Actuation for Dynamic Extended Resource](#actuation-for-dynamic-extended-resource)
1622
- [Test Plan](#test-plan)
1723
- [Prerequisite testing updates](#prerequisite-testing-updates)
@@ -97,8 +103,8 @@ the node is dynamically allocated to the pod, with the remaining 7 GPUs left for
97103
allocation for future requests from either extended resource, or DRA resource claim.
98104

99105
Note that another node in the same cluster has installed device plugin, which
100-
may have advertised e.g. 'nvidia.com/gpu: 2' in its `Node`'s Capacity. The same
101-
`Deployment` can possibly be scheduled and run on this node too.
106+
may have advertised e.g. 'example.com/gpu: 2' in its `Node`'s Capacity. The same
107+
`Deployment` can possibly be scheduled and run on that node too.
102108

103109
```yaml
104110
apiVersion: apps/v1
@@ -122,14 +128,14 @@ spec:
122128
args: ["nvidia-smi && tail -f /dev/null"]
123129
resources:
124130
limits:
125-
nvidia.com/gpu: 1
131+
example.com/gpu: 1
126132
```
127133
128134
```yaml
129135
apiVersion: resource.k8s.io/v1beta1
130136
kind: ResourceSlice
131137
metadata:
132-
name: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2-gpu.nvidia.coqj92d
138+
name: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2-gpu.coqj92d
133139
spec:
134140
devices:
135141
- basic:
@@ -148,9 +154,9 @@ spec:
148154
name: gpu-6
149155
- basic:
150156
name: gpu-7
151-
driver: gpu.nvidia.com
157+
driver: gpu.example.com
152158
nodeName: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2
153-
extendedResourceName: nvidia.com/gpu
159+
extendedResourceName: example.com/gpu
154160
```
155161
156162
```yaml
@@ -181,7 +187,7 @@ status:
181187
hugepages-2Mi: "0"
182188
memory: 15335536Ki
183189
pods: "110"
184-
nvidia.com/gpu: 2
190+
example.com/gpu: 2
185191
```
186192
187193
@@ -190,8 +196,8 @@ non-goals of this KEP.
190196
191197
### Goals
192198
193-
* Introduce the ability for DRA to advertise extended resources listed in a
194-
ResourceSlice, and for the scheduler to consider them for allocation.
199+
* Introduce the ability for DRA to advertise extended resources, and for the
200+
scheduler to consider them for allocation.
195201
196202
* Enable application operators to use the existing extended resource request in
197203
pod spec to request for DRA resources.
@@ -200,10 +206,17 @@ non-goals of this KEP.
200206
for the short term. Its ease of use is one big advantage to keep it remaining
201207
useful for the long term.
202208
209+
* Device plugin API must not change. The existing device plugin drivers must
210+
continue working without change.
211+
212+
* DRA driver API change must be minimal, if there is any. Core kubernetes
213+
(kube-scheduler, kubelet) is preferred over DRA driver for any change needed
214+
to support the feature.
215+
203216
### Non-Goals
204217
205-
* Simplify DRA driver developement. The DRA driver needs to support both DRA
206-
and extended resource API. This KEP adds complexity and cost to the driver.
218+
* Minimize kubelet or kube-scheduler changes. The feature requires necessary
219+
changes in both scheduling and actuation.
207220
208221
## Proposal
209222
@@ -241,7 +254,7 @@ resource, and dynamic extended resource.
241254
`ResourceSlice` to provide resource capacity. A pod asks for resources through
242255
resource claim requests in pod's spec.resources.claims. Dynamic resource type
243256
is described in resource slice, simply speaking, it is a list of devices, with
244-
each device being described as structured paramaters.
257+
each device being described as structured parameters.
245258
* dynamic extended resource is a combination of the two above. It uses pods'
246259
spec.containers[].resources.requests to request for resources, and uses
247260
`ResourceSlice` to provide resource capacity. Hence, it is of type: string,
@@ -312,24 +325,123 @@ advertised as the given extended resource name. If a device has a different
312325
extended resource name than that given in the `ResoureSlice`, the device's
313326
extended resource name is used for that device.
314327

328+
### Device Class API
329+
The extended resource name to device mapping can be specified at
330+
`DeviceClassSpec`, instead of at the `Device` in `ResourceSlice` API as shown in
331+
the section above. The same extended resource name can be given to different
332+
device classes, and one device class can have at most one extended resource name.
333+
334+
```go
335+
// DeviceClassSpec is used in a [DeviceClass] to define what can be allocated
336+
// and how to configure it.
337+
type DeviceClassSpec struct {
338+
// Each selector must be satisfied by a device which is claimed via this class.
339+
//
340+
// +optional
341+
// +listType=atomic
342+
Selectors []DeviceSelector `json:"selectors,omitempty" protobuf:"bytes,1,opt,name=selectors"`
343+
344+
// Config defines configuration parameters that apply to each device that is claimed via this class.
345+
// Some classses may potentially be satisfied by multiple drivers, so each instance of a vendor
346+
// configuration applies to exactly one driver.
347+
//
348+
// They are passed to the driver, but are not considered while allocating the claim.
349+
//
350+
// +optional
351+
// +listType=atomic
352+
Config []DeviceClassConfiguration `json:"config,omitempty" protobuf:"bytes,2,opt,name=config"`
353+
354+
// ExtendedResourceName is the extended resource name
355+
// the device class is advertised as. It must be a DNS label.
356+
// All devices matched by the device class can be used to satisfy the
357+
// extended resource requests in pod's spec.
358+
//
359+
// +optional
360+
ExtendedResourceName string
361+
}
362+
```
363+
315364
### Resource Claim API
316-
There is no API change on `ResourceClaim`, i.e. no new API type. However, a special
317-
resource claim object is created to keep track of device allocations for dyanmic
318-
extended resource. The special resource claim object has following properties:
365+
A new field `ExtendedResults` of type
366+
`DeviceExtendedResourceRequestAllocationResult` is added to hold the allocated
367+
devices for the extended resources. The existing field `Results` cannot be
368+
reused directly without breaking downgrade.
369+
370+
```go
371+
// DeviceAllocationResult is the result of allocating devices.
372+
type DeviceAllocationResult struct {
373+
// Results lists all allocated devices.
374+
//
375+
// +optional
376+
// +listType=atomic
377+
Results []DeviceRequestAllocationResult
378+
// ExtendedResults lists all allocated devices for extended resource
379+
// requests.
380+
//
381+
// +optional
382+
// +listType=atomic
383+
ExtendedResults []DeviceExtendedResourceRequestAllocationResult
384+
}
385+
386+
// DeviceExtendedResourceRequestAllocationResult contains the allocation result
387+
// for extended resource request.
388+
type DeviceExtendedResourceRequestAllocationResult struct {
389+
// ExtendedResourceName is the extended resource name the devices are
390+
// allocated for.
391+
//
392+
// +required
393+
ExtendedResourceName string `json:"extendedResourceName" protobuf:"bytes,1,name=extendedResourceName"`
394+
395+
// Driver specifies the name of the DRA driver whose kubelet
396+
// plugin should be invoked to process the allocation once the claim is
397+
// needed on a node.
398+
//
399+
// Must be a DNS subdomain and should end with a DNS domain owned by the
400+
// vendor of the driver.
401+
//
402+
// +required
403+
Driver string `json:"driver" protobuf:"bytes,2,name=driver"`
404+
405+
// This name together with the driver name and the device name field
406+
// identify which device was allocated (`<driver name>/<pool name>/<device name>`).
407+
//
408+
// Must not be longer than 253 characters and may contain one or more
409+
// DNS sub-domains separated by slashes.
410+
//
411+
// +required
412+
Pool string `json:"pool" protobuf:"bytes,3,name=pool"`
413+
414+
// Device references one device instance via its name in the driver's
415+
// resource pool. It must be a DNS label.
416+
//
417+
// +required
418+
Device string `json:"device" protobuf:"bytes,4,name=device"`
419+
420+
// AdminAccess indicates that this device was allocated for
421+
// administrative access. See the corresponding request field
422+
// for a definition of mode.
423+
//
424+
// This is an alpha field and requires enabling the DRAAdminAccess
425+
// feature gate. Admin access is disabled if this field is unset or
426+
// set to false, otherwise it is enabled.
427+
//
428+
// +optional
429+
// +featureGate=DRAAdminAccess
430+
AdminAccess *bool `json:"adminAccess" protobuf:"bytes,5,name=adminAccess"`
431+
}
432+
```
433+
434+
A special resource claim object is created to keep track of device allocations for
435+
extended resource. The resource claim object has following properties:
319436

320437
* It is namespace scoped, like other resource claim objects.
321438
* It is owned by a pod, like other resource claim objects.
322439
* It has null `spec`.
323440
* Its `status.allocation.devices` and `status.allocation.reservedFor` are
324441
used.
325-
* It has annotation `resource.kubernetes.io/extended-resource-name:`, and it
326-
does not have annotation `resource.kubernetes.io/pod-claim-name:`
327-
328-
```yaml
329-
metadata:
330-
annotations:
331-
resource.kubernetes.io/extended-resource-name: foo.domain/bar
332-
```
442+
* It does not have annotation `resource.kubernetes.io/pod-claim-name:` as
443+
it is created for the extended resource request in a pod spec, not for a
444+
claim in the pod spec.
333445

334446
The special resource claim object lifecycle is managed by the scheduler and
335447
garbage collector.
@@ -338,36 +450,48 @@ garbage collector.
338450
request, and the extended resource is advertised by `ResourceSlice` and
339451
scheduler has fit the pod to a node with the `ResourceSlice`.
340452
* It is *created* by the scheduler dynamic extended resource plugin during
341-
pre-bind phase. The in-memory one in the assumed cache is created earlier
453+
preBind phase. The in-memory one in the assumed cache is created earlier
342454
during reserve phase.
343455
* It is *deleted* together with the owning pod's deletion.
456+
* It is *deleted* by the scheduler dynamic extended resource plugin during
457+
unReserve phase.
344458
* It is *read* by scheduler dynamic resource plugin for the devices allocated,
345459
so that the scheduler remove considerations for allocation of these devices for
346460
other DRA resource claim requests in 'dynamic resource plugin'.
347461
* It is *read* by the kubelet DRA device driver to prepare the devices listed
348462
therein when preparing to run the pod.
349463

350464
### Pod API
351-
There is no API change on `Pod`. Pod's status.resourceClaimStatuses tracks the
352-
special resouceclaim object created for the dynamic extended resource requests
353-
in the pod. The dynamic extended resource name is used in the status. For
354-
example, if a pod has requested for foo.domain/bar, and it is scheduled to run
355-
on a node that has advertised foo.domain/bar in `ResourceSlice`, then the pod's
356-
status is like below:
465+
466+
A new field `extendedResourceClaimStatuses` is added to Pod's status to track
467+
the special resouceclaim object created for the dynamic extended resource requests
468+
in the pod. The dynamic extended resource name is used in the status. For example,
469+
if a pod has requested for foo.domain/bar, and it is scheduled to run on a node
470+
that has advertised foo.domain/bar in `ResourceSlice`, then the pod's status is
471+
like below:
357472

358473

359474
```yaml
360475
status:
361-
resourceClaimStatuses:
476+
extendedResourceClaimStatuses:
362477
- name: foo.domain/bar
363478
resourceClaimName: ccc-gpu-57999b9c4c-vpq68-gpu-8s27z
364479
```
480+
Note the validations for extendedResourceClaimStatuses are different from the
481+
validations for resourceClaimStatuses.
482+
483+
1. resourceClaimStatuses requires `name` must be DNS label,
484+
extendedResourceClaimStatuses's name does not need to be DNS label.
485+
1. resourceClaimStatuses requires `name` must be one of the claim's name in the
486+
pod spec. extendedResourceClaimStatuses requires `name` must be one of the
487+
extended resource name in the pod spec.
365488

366489
### Scheduling for Dynamic Extended Resource
367490

368-
A new field `DynamicResources` is added to `Resource`, it works similar to
369-
ScalarResources. It is used to keep track of the dynamic extended resources on a
370-
node, i.e. those that are advertised by `ResourceSlice`.
491+
A new field `DynamicResources` is added to
492+
[`Resource`](https://github.com/kubernetes/kubernetes/blob/c81431de59a3bf516489317433a165b050322339/pkg/scheduler/framework/types.go#L798),
493+
it works similar to ScalarResources. It is used to keep track of the dynamic extended
494+
resources on a node, i.e. those that are advertised by `ResourceSlice`.
371495

372496
```go
373497
type Resource struct {
@@ -380,19 +504,28 @@ type Resource struct {
380504
// ScalarResources
381505
ScalarResources map[v1.ResourceName]int64
382506
383-
// NEW!
384-
// DynamicResources: keep track of dynamic extended resources
507+
// NEW!
508+
// DynamicResources: keep track of dynamic extended resources
385509
DynamicResources map[v1.ResourceName]int64
386510
}
387511
```
388512

389513
type `NodeInfo` is used by scheduler to keep track of the information for each
390514
node in memory. Its `Allocatable` field is used to keep track of the allocatable
391-
resources in memory. For a node with extended resources, its NodeInfo's
392-
Allocatable.ScalarResources is updated with the `Node`'s informer, minus the
393-
used. For a node with dynamic extended resources, its NodeInfo's
394-
Allocatable.DynamicResources is updated with the `ResourceSlice`'s informer,
395-
minus used by either dynamic extended resource or resource claims.
515+
resources in memory. At the beginning of each scheduling cycle, scheduler takes
516+
a snapshot of all the nodes in the cluster, and updates their corresponding
517+
`NodeInfo`.
518+
519+
For the scheduler with DRA enabled, right after taking the node snapshot, the
520+
scheduler also takes a snapshot of `ResoureClaims` and `ResourceSlice`, and
521+
updates `NodeInfo.DynamicResources` if the node has resources backed by DRA
522+
`ResourceSlice`.
523+
524+
For a node with extended resources, its NodeInfo's
525+
Allocatable.ScalarResources is updated with the k8s `Node`'s object.
526+
For a node with dynamic extended resources, its NodeInfo's
527+
Allocatable.DynamicResources is updated based on DRA `ResourceSlice` and
528+
`ResourceClaim` objects.
396529

397530
The existing 'noderesources' plugin needs to be modified, such that a pod's
398531
extended resource request is checked against a NodeInfo's ScalarResources if the
@@ -439,7 +572,7 @@ is scheduled to run, the following are particularly important:
439572
1. Kubelet tries to admit the pod, the pod's dynamic extended resources requests
440573
should not be checked against the `Node`'s allocatable, as the resources are
441574
in `ResourceSlice`, not in `Node`. Instead, kubelet needs to follow the admit
442-
process for the speical `ResourceClaim`.
575+
process for the special `ResourceClaim`.
443576

444577
1. Kubelet passes the special `ResoureClaim` to DRA driver to prepare the
445578
devices, in the same way as that for normal `ResourceClaim`.

0 commit comments

Comments
 (0)