Skip to content

Commit 38a2e05

Browse files
committed
made changes to reflect feedbacks from comments.
1 parent 5f074f6 commit 38a2e05

File tree

1 file changed

+174
-42
lines changed
  • keps/sig-scheduling/5004-dra-extended-resource

1 file changed

+174
-42
lines changed

keps/sig-scheduling/5004-dra-extended-resource/README.md

+174-42
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,15 @@
99
- [Proposal](#proposal)
1010
- [Design Details](#design-details)
1111
- [Resource Slice API](#resource-slice-api)
12+
- [Device Class API](#device-class-api)
1213
- [Resource Claim API](#resource-claim-api)
1314
- [Pod API](#pod-api)
1415
- [Scheduling for Dynamic Extended Resource](#scheduling-for-dynamic-extended-resource)
16+
- [EventsToRegister](#eventstoregister)
17+
- [Score](#score)
18+
- [Reserve](#reserve)
19+
- [Prebind](#prebind)
20+
- [Unreserve](#unreserve)
1521
- [Actuation for Dynamic Extended Resource](#actuation-for-dynamic-extended-resource)
1622
- [Test Plan](#test-plan)
1723
- [Prerequisite testing updates](#prerequisite-testing-updates)
@@ -97,8 +103,8 @@ the node is dynamically allocated to the pod, with the remaining 7 GPUs left for
97103
allocation for future requests from either extended resource, or DRA resource claim.
98104

99105
Note that another node in the same cluster has installed device plugin, which
100-
may have advertised e.g. 'nvidia.com/gpu: 2' in its `Node`'s Capacity. The same
101-
`Deployment` can possibly be scheduled and run on this node too.
106+
may have advertised e.g. 'example.com/gpu: 2' in its `Node`'s Capacity. The same
107+
`Deployment` can possibly be scheduled and run on that node too.
102108

103109
```yaml
104110
apiVersion: apps/v1
@@ -122,14 +128,14 @@ spec:
122128
args: ["nvidia-smi && tail -f /dev/null"]
123129
resources:
124130
limits:
125-
nvidia.com/gpu: 1
131+
example.com/gpu: 1
126132
```
127133
128134
```yaml
129135
apiVersion: resource.k8s.io/v1beta1
130136
kind: ResourceSlice
131137
metadata:
132-
name: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2-gpu.nvidia.coqj92d
138+
name: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2-gpu.coqj92d
133139
spec:
134140
devices:
135141
- basic:
@@ -148,9 +154,9 @@ spec:
148154
name: gpu-6
149155
- basic:
150156
name: gpu-7
151-
driver: gpu.nvidia.com
157+
driver: gpu.example.com
152158
nodeName: gke-drabeta-n1-standard-4-2xt4-346fe653-zrw2
153-
extendedResourceName: nvidia.com/gpu
159+
extendedResourceName: example.com/gpu
154160
```
155161
156162
```yaml
@@ -181,7 +187,7 @@ status:
181187
hugepages-2Mi: "0"
182188
memory: 15335536Ki
183189
pods: "110"
184-
nvidia.com/gpu: 2
190+
example.com/gpu: 2
185191
```
186192
187193
@@ -190,8 +196,8 @@ non-goals of this KEP.
190196
191197
### Goals
192198
193-
* Introduce the ability for DRA to advertise extended resources listed in a
194-
ResourceSlice, and for the scheduler to consider them for allocation.
199+
* Introduce the ability for DRA to advertise extended resources, and for the
200+
scheduler to consider them for allocation.
195201
196202
* Enable application operators to use the existing extended resource request in
197203
pod spec to request for DRA resources.
@@ -200,10 +206,17 @@ non-goals of this KEP.
200206
for the short term. Its ease of use is one big advantage to keep it remaining
201207
useful for the long term.
202208
209+
* Device plugin API must not change. The existing device plugin drivers must
210+
continue working without change.
211+
212+
* DRA driver API change must be minimal, if there is any. Core kubernetes
213+
(kube-scheduler, kubelet) is preferred over DRA driver for any change needed
214+
to support the feature.
215+
203216
### Non-Goals
204217
205-
* Simplify DRA driver developement. The DRA driver needs to support both DRA
206-
and extended resource API. This KEP adds complexity and cost to the driver.
218+
* Minimize kubelet or kube-scheduler changes. The feature requires necessary
219+
changes in both scheduling and actuation.
207220
208221
## Proposal
209222
@@ -241,7 +254,7 @@ resource, and dynamic extended resource.
241254
`ResourceSlice` to provide resource capacity. A pod asks for resources through
242255
resource claim requests in pod's spec.resources.claims. Dynamic resource type
243256
is described in resource slice, simply speaking, it is a list of devices, with
244-
each device being described as structured paramaters.
257+
each device being described as structured parameters.
245258
* dynamic extended resource is a combination of the two above. It uses pods'
246259
spec.containers[].resources.requests to request for resources, and uses
247260
`ResourceSlice` to provide resource capacity. Hence, it is of type: string,
@@ -312,24 +325,122 @@ advertised as the given extended resource name. If a device has a different
312325
extended resource name than that given in the `ResoureSlice`, the device's
313326
extended resource name is used for that device.
314327

328+
### Device Class API
329+
The extended resource name to device mapping can be specified at device class
330+
spec, instead of at the `Device` in `ResourceSlice` API as shown in the section
331+
above.
332+
333+
```go
334+
// DeviceClassSpec is used in a [DeviceClass] to define what can be allocated
335+
// and how to configure it.
336+
type DeviceClassSpec struct {
337+
// Each selector must be satisfied by a device which is claimed via this class.
338+
//
339+
// +optional
340+
// +listType=atomic
341+
Selectors []DeviceSelector `json:"selectors,omitempty" protobuf:"bytes,1,opt,name=selectors"`
342+
343+
// Config defines configuration parameters that apply to each device that is claimed via this class.
344+
// Some classses may potentially be satisfied by multiple drivers, so each instance of a vendor
345+
// configuration applies to exactly one driver.
346+
//
347+
// They are passed to the driver, but are not considered while allocating the claim.
348+
//
349+
// +optional
350+
// +listType=atomic
351+
Config []DeviceClassConfiguration `json:"config,omitempty" protobuf:"bytes,2,opt,name=config"`
352+
353+
// ExtendedResourceName is the extended resource name
354+
// the device class is advertised as. It must be a DNS label.
355+
// All devices matched by the device class can be used to satisfy the
356+
// extended resource requests in pod's spec.
357+
//
358+
// +optional
359+
ExtendedResourceName string
360+
}
361+
```
362+
315363
### Resource Claim API
316-
There is no API change on `ResourceClaim`, i.e. no new API type. However, a special
317-
resource claim object is created to keep track of device allocations for dyanmic
318-
extended resource. The special resource claim object has following properties:
364+
A new field `ExtendedResults` of type
365+
`DeviceExtendedResourceRequestAllocationResult` is added to hold the allocated
366+
devices for the extended resources. The existing field `Results` cannot be
367+
reused directly without breaking downgrade.
368+
369+
```go
370+
// DeviceAllocationResult is the result of allocating devices.
371+
type DeviceAllocationResult struct {
372+
// Results lists all allocated devices.
373+
//
374+
// +optional
375+
// +listType=atomic
376+
Results []DeviceRequestAllocationResult
377+
// ExtendedResults lists all allocated devices for extended resource
378+
// requests.
379+
//
380+
// +optional
381+
// +listType=atomic
382+
ExtendedResults []DeviceExtendedResourceRequestAllocationResult
383+
}
384+
385+
// DeviceExtendedResourceRequestAllocationResult contains the allocation result
386+
// for extended resource request.
387+
type DeviceExtendedResourceRequestAllocationResult struct {
388+
// ExtendedResourceName is the extended resource name the devices are
389+
// allocated for.
390+
//
391+
// +required
392+
ExtendedResourceName string `json:"extendedResourceName" protobuf:"bytes,1,name=extendedResourceName"`
393+
394+
// Driver specifies the name of the DRA driver whose kubelet
395+
// plugin should be invoked to process the allocation once the claim is
396+
// needed on a node.
397+
//
398+
// Must be a DNS subdomain and should end with a DNS domain owned by the
399+
// vendor of the driver.
400+
//
401+
// +required
402+
Driver string `json:"driver" protobuf:"bytes,2,name=driver"`
403+
404+
// This name together with the driver name and the device name field
405+
// identify which device was allocated (`<driver name>/<pool name>/<device name>`).
406+
//
407+
// Must not be longer than 253 characters and may contain one or more
408+
// DNS sub-domains separated by slashes.
409+
//
410+
// +required
411+
Pool string `json:"pool" protobuf:"bytes,3,name=pool"`
412+
413+
// Device references one device instance via its name in the driver's
414+
// resource pool. It must be a DNS label.
415+
//
416+
// +required
417+
Device string `json:"device" protobuf:"bytes,4,name=device"`
418+
419+
// AdminAccess indicates that this device was allocated for
420+
// administrative access. See the corresponding request field
421+
// for a definition of mode.
422+
//
423+
// This is an alpha field and requires enabling the DRAAdminAccess
424+
// feature gate. Admin access is disabled if this field is unset or
425+
// set to false, otherwise it is enabled.
426+
//
427+
// +optional
428+
// +featureGate=DRAAdminAccess
429+
AdminAccess *bool `json:"adminAccess" protobuf:"bytes,5,name=adminAccess"`
430+
}
431+
```
432+
433+
A special resource claim object is created to keep track of device allocations for
434+
extended resource. The resource claim object has following properties:
319435

320436
* It is namespace scoped, like other resource claim objects.
321437
* It is owned by a pod, like other resource claim objects.
322438
* It has null `spec`.
323439
* Its `status.allocation.devices` and `status.allocation.reservedFor` are
324440
used.
325-
* It has annotation `resource.kubernetes.io/extended-resource-name:`, and it
326-
does not have annotation `resource.kubernetes.io/pod-claim-name:`
327-
328-
```yaml
329-
metadata:
330-
annotations:
331-
resource.kubernetes.io/extended-resource-name: foo.domain/bar
332-
```
441+
* It does not have annotation `resource.kubernetes.io/pod-claim-name:` as
442+
it is created for the extended resource request in a pod spec, not for a
443+
claim in the pod spec.
333444

334445
The special resource claim object lifecycle is managed by the scheduler and
335446
garbage collector.
@@ -338,36 +449,48 @@ garbage collector.
338449
request, and the extended resource is advertised by `ResourceSlice` and
339450
scheduler has fit the pod to a node with the `ResourceSlice`.
340451
* It is *created* by the scheduler dynamic extended resource plugin during
341-
pre-bind phase. The in-memory one in the assumed cache is created earlier
452+
preBind phase. The in-memory one in the assumed cache is created earlier
342453
during reserve phase.
343454
* It is *deleted* together with the owning pod's deletion.
455+
* It is *deleted* by the scheduler dynamic extended resource plugin during
456+
unReserve phase.
344457
* It is *read* by scheduler dynamic resource plugin for the devices allocated,
345458
so that the scheduler remove considerations for allocation of these devices for
346459
other DRA resource claim requests in 'dynamic resource plugin'.
347460
* It is *read* by the kubelet DRA device driver to prepare the devices listed
348461
therein when preparing to run the pod.
349462

350463
### Pod API
351-
There is no API change on `Pod`. Pod's status.resourceClaimStatuses tracks the
352-
special resouceclaim object created for the dynamic extended resource requests
353-
in the pod. The dynamic extended resource name is used in the status. For
354-
example, if a pod has requested for foo.domain/bar, and it is scheduled to run
355-
on a node that has advertised foo.domain/bar in `ResourceSlice`, then the pod's
356-
status is like below:
464+
465+
A new field `extendedResourceClaimStatuses` is added to Pod's status to track
466+
the special resouceclaim object created for the dynamic extended resource requests
467+
in the pod. The dynamic extended resource name is used in the status. For example,
468+
if a pod has requested for foo.domain/bar, and it is scheduled to run on a node
469+
that has advertised foo.domain/bar in `ResourceSlice`, then the pod's status is
470+
like below:
357471

358472

359473
```yaml
360474
status:
361-
resourceClaimStatuses:
475+
extendedResourceClaimStatuses:
362476
- name: foo.domain/bar
363477
resourceClaimName: ccc-gpu-57999b9c4c-vpq68-gpu-8s27z
364478
```
479+
Note the validations for extendedResourceClaimStatuses are different from the
480+
validations for resourceClaimStatuses.
481+
482+
1. resourceClaimStatuses requires `name` must be DNS label,
483+
extendedResourceClaimStatuses's name does not need to be DNS label.
484+
1. resourceClaimStatuses requires `name` must be one of the claim's name in the
485+
pod spec. extendedResourceClaimStatuses requires `name` must be one of the
486+
extended resource name in the pod spec.
365487

366488
### Scheduling for Dynamic Extended Resource
367489

368-
A new field `DynamicResources` is added to `Resource`, it works similar to
369-
ScalarResources. It is used to keep track of the dynamic extended resources on a
370-
node, i.e. those that are advertised by `ResourceSlice`.
490+
A new field `DynamicResources` is added to
491+
[`Resource`](https://github.com/kubernetes/kubernetes/blob/c81431de59a3bf516489317433a165b050322339/pkg/scheduler/framework/types.go#L798),
492+
it works similar to ScalarResources. It is used to keep track of the dynamic extended
493+
resources on a node, i.e. those that are advertised by `ResourceSlice`.
371494

372495
```go
373496
type Resource struct {
@@ -380,19 +503,28 @@ type Resource struct {
380503
// ScalarResources
381504
ScalarResources map[v1.ResourceName]int64
382505
383-
// NEW!
384-
// DynamicResources: keep track of dynamic extended resources
506+
// NEW!
507+
// DynamicResources: keep track of dynamic extended resources
385508
DynamicResources map[v1.ResourceName]int64
386509
}
387510
```
388511

389512
type `NodeInfo` is used by scheduler to keep track of the information for each
390513
node in memory. Its `Allocatable` field is used to keep track of the allocatable
391-
resources in memory. For a node with extended resources, its NodeInfo's
392-
Allocatable.ScalarResources is updated with the `Node`'s informer, minus the
393-
used. For a node with dynamic extended resources, its NodeInfo's
394-
Allocatable.DynamicResources is updated with the `ResourceSlice`'s informer,
395-
minus used by either dynamic extended resource or resource claims.
514+
resources in memory. At the beginning of each scheduling cycle, scheduler takes
515+
a snapshot of all the nodes in the cluster, and updates their corresponding
516+
`NodeInfo`.
517+
518+
For the scheduler with DRA enabled, right after taking the node snapshot, the
519+
scheduler also takes a snapshot of `ResoureClaims` and `ResourceSlice`, and
520+
updates `NodeInfo.DynamicResources` if the node has resources backed by DRA
521+
`ResourceSlice`.
522+
523+
For a node with extended resources, its NodeInfo's
524+
Allocatable.ScalarResources is updated with the k8s `Node`'s object.
525+
For a node with dynamic extended resources, its NodeInfo's
526+
Allocatable.DynamicResources is updated based on DRA `ResourceSlice` and
527+
`ResourceClaim` objects.
396528

397529
The existing 'noderesources' plugin needs to be modified, such that a pod's
398530
extended resource request is checked against a NodeInfo's ScalarResources if the
@@ -439,7 +571,7 @@ is scheduled to run, the following are particularly important:
439571
1. Kubelet tries to admit the pod, the pod's dynamic extended resources requests
440572
should not be checked against the `Node`'s allocatable, as the resources are
441573
in `ResourceSlice`, not in `Node`. Instead, kubelet needs to follow the admit
442-
process for the speical `ResourceClaim`.
574+
process for the special `ResourceClaim`.
443575

444576
1. Kubelet passes the special `ResoureClaim` to DRA driver to prepare the
445577
devices, in the same way as that for normal `ResourceClaim`.

0 commit comments

Comments
 (0)