Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-5027 + 5055: DRA: admin-controlled device attributes + device taints #5034

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jan 10, 2025

/cc @johnbelamaric

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 10, 2025
@pohly
Copy link
Contributor Author

pohly commented Jan 12, 2025

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

@k8s-ci-robot k8s-ci-robot requested a review from byako January 12, 2025 12:31
@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to request PR reviews from the following users: KobayashiD27.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @KobayashiD27

For the "device priority" use case.

/cc @byako

For device health.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pohly
Copy link
Contributor Author

pohly commented Jan 13, 2025

/wg device-management
/sig node

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jan 13, 2025
@pohly pohly mentioned this pull request Jan 20, 2025
4 tasks
@pohly pohly force-pushed the dra-device-attribute-overrides branch from 531a905 to cddc84f Compare January 20, 2025 14:31
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 20, 2025
@pohly pohly force-pushed the dra-device-attribute-overrides branch from cddc84f to c4a6f66 Compare January 20, 2025 14:40
These are two different KEPs that provide two features that can be enabled and
disabled independently. However, both use the same new ResourceSliceOverride
type and thus get described and implemented together.
@pohly pohly force-pushed the dra-device-attribute-overrides branch from c4a6f66 to 41cdbf5 Compare January 20, 2025 14:44
Copy link

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was earlier discussion of common (driver independent) tool(ing) for listing, adding and removing device taints. Would it make sense to mention something about that in the tainting KEP?

@pohly pohly changed the title KEP-5027: DRA: admin-controlled device attributes KEP-5027: DRA: admin-controlled device attributes + device taints Jan 20, 2025
Copy link
Contributor

@everpeace everpeace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deeply appreciated for your quick action for device taints/tolerations KEP!! I left some comments. PTAL.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign huang-wei, soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pohly
Copy link
Contributor Author

pohly commented Jan 21, 2025

As discussed today during the WG Device Management meeting, these two KEPs have no impact on the kubelet and thus should better be owned by SIG Scheduling alone.

// and/or CEL selectors. All of these criteria must be satisfied by a device, otherwise
// it is ignored by the override. A DeviceOverride with no selection criteria is
// valid and matches all devices.
type DeviceOverride struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the API might be more human-readable by moving the set of filters/selectors into a separate parent struct, so that the actual override data (Attributes/Capacity) is distinctly organized, and more easily disambiguated.

Copy link
Contributor Author

@pohly pohly Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them into a DeviceOverrideSelector struct which is stored in a Selector field. The actual YAML then would look like this:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSliceOverride
metadata:
  ...
spec:
  devices:
    selector:
      driver: dra.example.com
      pool: work-node
      device: gpu-0
    attributes:
      my-additional-attribute-foo:
        string: bar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word "selector" for the field becomes a bit more problematic when considering CEL selectors inside it:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSliceOverride
metadata:
  ...
spec:
  devices:
    selector:
      deviceClass: dra.example.com
      selectors:
      - cel:
          expression: device.attributes["dra.example.com"].uid == "ABCD-1234"
    taints:
    - key: dra.example.com/unhealthy
      value: memory checksum error
      effect: NoSchedule

Selectors inside a selector? Hmm...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

spec:
  devices:
    filters:
      deviceClass: dra.example.com
      selectors:
      - cel:
          expression: device.attributes["dra.example.com"].uid == "ABCD-1234"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be "filters" or "filter"? It's not a list.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably "filter" is the best compromise. A single "filter" can be a composition of a set of selectors, right? Even if we're playing loose and fast w/ grammar, I agree that "filters" suggests a list/array/slice instead of a dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's "filter" and DevicePatchFilter.


The intent to override device attributes must be recorded persistently so that
it is preserved even when a ResourceSlice gets removed or updated. To achieve
this, a new cluster-scoped ResourceSliceOverride type gets added. A single

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this new API enables partial (or no) overrides across the possible sets of Attributes + Capacity key-values, and that it also enables adding brand new key-values as an extension of the existing ResourceSlice data... does that suggest that ResourceSlicePatch is a more semantically expressive name for this new API? The term "override" doesn't entirely capture the tolerant outcome that our proposed merging strategy will yield.

Not trying to bike shed too much! wdyt?

(Patch is also expressive of our canonical use case: cluster admins updating a device driver as a part of node maintenance.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion. I was struggling with "override" myself when considering cases where the actual merge strategy isn't a strict "one value value wins", for example for taints.

While making the change, I noticed one complication: the plural of "Patch" is "Patches". This non-standard plural form makes some of the API implementation icky. But good naming is worth that inconvenience, so let's go with it unless someone has a better suggestion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for considering this.

I think this is more expressive, and also makes the KEP itself a bit more readable as all the concepts come together more easily.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Gentle reminder to propagate the final name throughout the taints/tolerations KEP once that's decided)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of patches...

As described right now, a ResourceSlicePatch cannot remove attributes. I don't have a specific use case in mind for it, I'm just seeing the gap.

One way of supporting it would be add a Remove *bool in DeviceAttribute which can only be set in a ResourceSlicePatch:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSliceOverride
metadata:
  ...
spec:
  devices:
    selector:
      driver: dra.example.com
      pool: work-node
      device: gpu-0
    attributes:
      some-existing-attribute:
        remove: true

Would this be useful?

Note that this can break user's CEL expressions: if a vendor defines "some-existing-attribute is always set for our devices", then users don't need to check for existence. An admin removing it then causes attribute lookup errors for those users.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any immediate use case for this, but agree it seems worth enabling if it doesn't add too much overhead.

I suppose one other alternative would be something like defining an attribute with no value:

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlicePatch
metadata:
  ...
spec:
  devices:
    selector:
      driver: dra.example.com
      pool: work-node
      device: gpu-0
    attributes:
      some-existing-attribute: {}

If an empty attribute can also be defined in a regular ResourceSlice like that and it's functionally equivalent to the attribute not being defined at all, then the semantics might be simpler than a distinct remove toggle. On the other hand, it's leaning a little into "magic" territory where it's not obvious what an empty value means just by looking at the API.

One tweak to that to make it a little more explicit might be to have an explicit null value for an attribute.

apiVersion: resource.k8s.io/v1alpha3
kind: ResourceSlicePatch
metadata:
  ...
spec:
  devices:
    selector:
      driver: dra.example.com
      pool: work-node
      device: gpu-0
    attributes:
      some-existing-attribute:
        null: {}

This also looks a little weird though and will likely require some nonsense like this:

if attr.NullValue != nil {
	// attribute is null
}

This high-level approach would be more similar to something like a plain JSON merge patch, where having a separate remove field feels more like a RFC 6902 JSON patch. If we add a remove field, maybe we could instead add an op field like in the RFC and make remove one possible value for that alongside others like add or replace. That might make remove less of a special case at the expense of making the more common add/replace case a little more verbose.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the zero value approach is sustainable and equivalent to delete in terms of expressing "I want to override any actionable outcomes that the existing attribute's value may initiate".

  • an attribute with a string "" zero value is reliably equivalent to the attribute not existing (golang idiom would suggest that if a "" empty string value is significant it should be implemented as a *string to disambiguate between and explicit "" and "no user-provided value"
  • any bool value can be equivalently "deleted" by setting to false (if false is an explicit, non-default value, then it should be implemented as a *bool)
  • struct values can be "deleted" via {}
  • numeric values for which 0 is equivalent should be implemented as a pointer
  • any pointer to a type can be equivalently "deleted" by setting the value to nil

Maybe I'm overthinking the above and the set of attribute types if more strictly constrained?

tl;dr I think the zero value approach is more elegant if it is reliably deterministic in the ways I've described

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some-existing-attribute: {}

The problem is that a null DeviceAttribute where all fields were left unset by the user cannot be distinguished from a future DeviceAttribute where some new field was set which the client doesn't know about yet. All fields in DeviceAttribute are part of a "one of": exactly one must be set for it to be valid. Receiving no fields from the apiserver tells clients that they are out-dated and cannot handle the DeviceAttribute.

We use this in several places in the Kubernetes API to prevent clients from doing something that they shouldn't be doing because they don't know better. In this case, a client would remove an attribute instead of overriding it with some unknown value type. The explicit remove: true avoids that. So does null: {}. I like null: {} a little better.

On the other hand, it's leaning a little into "magic" territory where it's not obvious what an empty value means just by looking at the API.

That's also true.

I don't see any immediate use case for this, but agree it seems worth enabling if it doesn't add too much overhead.

I doubt that it adds overhead. It's mostly just extra work for the design (see discussion above...) and review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it for attributes. I left it out for capacities because setting a capacity to zero seems sufficient and DeviceCapacity cannot be extended as easily as a DeviceAttribute to a "null capacity" (not a one-off, Value field is required).

Comment on lines 363 to 365
If a CEL expression fails for a device, the override does not apply and an
event will be generated for the ResourceSlicePatch with the faulty CEL
expression.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "fail" in this context mean an invalid CEL expression caused by something like a syntax error, and not that it cleanly evaluates to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fails to evaluate to a boolean (runtime error, wrong result type)".

Syntax errors are caught during validation, but the attribute lookup is not type safe (devices.attributes[...].someField may or may not be a bool) and can cause key lookup exceptions (in this case, if someField isn't matching some attribute).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the paragraph.

@pohly pohly changed the title KEP-5027: DRA: admin-controlled device attributes + device taints KEP-5027 + 5055: DRA: admin-controlled device attributes + device taints Jan 24, 2025
// satisfied by a device to be patched.
//
// +optional
DeviceClass *string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth naming this DeviceClassName to be consistent with that field in ResourceClaims?

https://github.com/kubernetes/kubernetes/blob/03bf94bac074ce43228ee906a8cadea6176873c0/staging/src/k8s.io/api/resource/v1beta1/types.go#L411-L424

Also perhaps Pool -> PoolName?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be consistent, and thus DeviceClassName.

But driver, pool, and device are referred to without the Name suffix (https://github.com/kubernetes/kubernetes/blob/03bf94bac074ce43228ee906a8cadea6176873c0/staging/src/k8s.io/api/resource/v1beta1/types.go#L1012-L1035).

This is based on an API guideline which says "use *Name only for API objects". DeviceClass is an API object, "pool" isn't. I personally would have preferred "DriverName" instead of "Driver" because there is a difference between a "driver" (the thing, perhaps described by a struct) and a "driver name" (one particular attribute of it) and and had that in initial revisions of the API, but was told to remove the suffix for the sake of consistency with other APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrrm. "DeviceClassName" next to "Driver/Pool/Device" looks odd. Not sure what a good solution is here. Also, suppose we do add a "resource.Driver" type similar to "storage.CSIDriver". Then "DriverName" suddenly would become more suitable than "Driver". Still not a fan of this API convention.... 🤷

I'm going with "consistent with other fields" for now, but we may have to revisit as part of the final API review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also added comments that explain what those other fields are.


The feature is following the approach and APIs taken for node taints and
applies them to devices. A new controller watches tainted devices and deletes
pods using them unless they tolerate the device taint, similar to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operators like training operator (https://github.com/kubeflow/training-operator) can spawn more pods when pods are deleted, will pod deletion cause pending pods in the cluster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pods can also have finalizers set by different operators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A finalizer doesn't prevent stopping the containers. As soon as the kubelet sees a pod with DeletionTimestamp, it will stop it. That's sufficient for this KEP.

The pod object then remains in the API server with a final "stopped and deleted" state. That in turn is sufficient for DRA to deallocated claims, so this also doesn't block devices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. If an operator pod spawns the next pod referencing the same claim, will such a claim be deallocated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I'm one of this KEP authors.

In this design, Taints will be defined per device in ResourceSlice(Override). Tolerations will be declared per device in ResourClaim.

If an operator pod spawns the next pod referencing the same claim, will such a claim be deallocated?

Thus, if multiple pods shared ResourceClaims and some allocated devices in them are tainted, and the claim can not tolerate the taints, yes, all the pods will be deleted and the resource claim will be unallocated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A finalizer doesn't prevent stopping the containers. ...

Added also as text in the KEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @everpeace. this may be good for workloads that have all-or-nothing semantics (e.g., AI workloads); on the downside, independent workloads consuming a single claim will have a hit even if one of the independent pods crashes. is my understanding correct?

Copy link
Contributor

@everpeace everpeace Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

independent workloads consuming a single claim will have a hit even if one of the independent pods crashes. is my understanding correct?

Yes, you're correct. But, I don't think it's downside.
No. Please see the bleow comment from pohly 🙇

IF multiple pods shared the same ResourceClaim and the device allocated the devices to the claim was tainted(effect=NoExecute), I think it's natural to delete all the pods using the ResourceClaim because the claim can not tolerate the taints(effect=NoExecute).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

independent workloads consuming a single claim will have a hit even if one of the independent pods crashes

A pod crashing seems unrelated to me. It doesn't cause the claim to get tainted, so other pods sharing the claim are not affected.


The other usage is to influence which devices are picked when there are
multiple viable alternatives. This is a first step towards providing a more
comprehensive [scoring](https://github.com/kubernetes/enhancements/issues/4970)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more information on how scoring can be achieved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to remove this and the preceeding paragraph. Device priority is no longer part of this KEP and health is a separate one now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

-->

One E2E test scenario is to mark all devices as offline and then verify that
pods don't get scheduled. Another is to set different priorities and check that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on this comment E2E tests also needs to be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder. Fixed.

pohly added 2 commits January 29, 2025 15:17
This gets added for the sake of completeness.
// Must be a label value.
//
// +optional
Value string `json:"value,omitempty" protobuf:"bytes,3,opt,name=value"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of values are expected here? E.g. true / false are longer than 3 bytes...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 is the protobuf index number, not the maximum length. Which reminds me, I should remove the struct tags because they don't add value here in the KEP...

The value is whatever the creator of a taint sets as value. The API itself doesn't care as long as it has the form of a Kubernetes label value. What that is exactly isn't defined for node taints ever, so presumably other documentation covers that or its considered common knowledge. It's the same as for labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which reminds me, I should remove the struct tags because they don't add value here in the KEP...

Fixed, together with tab vs spaces indention.


### Non-Goals

- Not part of the plan for alpha: developing a kubectl command for tainting devices.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Not part of the plan for alpha: developing a kubectl command for tainting devices.
- Not part of the plan for alpha: developing a kubectl command for managing device taints.

(Managing; listing, adding, removing taints.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed.

Comment on lines +274 to +277
// In contrast to attributes in a ResourceSlice, entries here are allowed to
// be marked as empty by setting their null field. Such entries remove the
// corresponding attribute in a ResourceSlice, if there is one, instead of
// overriding it.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth enforcing that null cannot be set in a ResourceSlice? If it's allowed, that would leave the option open for drivers to do so in case they want to communicate some nuance like "I think users may expect this attribute to exist, but it intentionally does not."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's too nuanced and adds one more thing that users writing CEL expressions would have to be prepared for, in addition to the "can I access this attribute without getting a lookup error".

Comment on lines 372 to 379
// NullValue, if set, marks an intentionally empty attribute.
//
// May be used inside a ResourceSlicePatch to remove attributes,
// but not in a ResourceSlice.
//
// +optional
// +oneOf=ValueType
NullValue *NullValue `json:"null,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CEL environment, will null values be omitted entirely from the device.attributes map if we don't handle them specially? If not, enabling that might be worthwhile since I can see that being a little more ergonomic for authors of CEL expressions than if they have to handle "explicit null" and "actually undefined" differently.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my question here boils down to "how does a Go map with a value of nil manifest in the CEL environment, and is that different from a Go map without that key/value pair at all?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ResourceSlices contain no null values and null values in a ResourceSlicePatch cause the attribute to be removed, there's never a situation where a CEL expression gets a null value when looking up an attribute. That is deliberate: we already have "attribute not set in map", we don't need "attribute set with null value". That is semantically so close that I don't see the need.

@@ -166,20 +166,8 @@ scheduler when selecting devices for user requests in ResourceClaims.
This KEP adds a Kubernetes API that privileged users, typically cluster
administrators, can use to override or extend that information. This can be
Copy link
Member

@asm582 asm582 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the admin be a maintenance controller? if so can we add the same to the KEP documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add.

// ^^^^
// The assumption here is that all device types will have attributes and capacities,
// similar to the current BasicDevice type. Therefore the overrides are not made
// specific to certain device types.
}

// DevicePatchFilter defines which device(s) a [DevicePatch] applies to.
type DevicePatchFilter struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be something out of the scope of this KEP, can an external controller add a selector that can be later consumed by the machinery described in the KEP?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This touches on the question whether DevicePatchFilter is immutable: it isn't, so whenever a selector gets added or removed, it changes how the scheduler evaluates the patch.

The machinery in this KEP doesn't care who does the updating, so it could be a controller.

Comment on lines +211 to +212
Creating a ResourceSlicePatch is racing with on-going scheduling attempts,
which is unavoidable.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own understanding, is the worst-case scenario here something like this?

  1. A Pod comes into the scheduler and a ResourceSlicePatch is created at the same time.
  2. The scheduler successfully schedules the Pod, having not yet observed the new ResourceSlicePatch.
  3. The ResourceSlicePatch makes modifications such that the Pod's ResourceClaims no longer match the devices it was allocated (e.g. changing an attribute referenced in a selector).
  4. The scheduled Pod continues to run with the unsuitable allocated device.

And does this same race condition already exist today when updating ResourceSlices since the scheduler's view of ResourceSlices is driven by an informer?

Is the "correct" answer to this to use only taints instead of attributes/capacity for anything that should cause a Pod to be evicted at runtime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct, on all points.

// ^^^
// `NullableDeviceAttribute` as an extension ensures that the OpenAPI
// for ResourceSlice remains unchanged. Using the same type with
// a `NullValue` that can be set only in one type is less clear.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaron-prindle: will a NullableDeviceAttribute with some "oneOf" alternatives inside the embedded DeviceAttribute and one more outside of it work for declarative validation?

It should work right now (OpenAPI flattens embedded structs) and it is more natural in Go (can use a NullableDeviceAttribute to initialize a DeviceAttribute without manually written copy code).

But if this then poses a problem for declarative validation, then it will be difficult to switch because the embedding leads to different protobuf encoding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how early declarative validation is, I feel confident saying "we can make it work". I don't think it is something we handle in the dev-branch prototype, but it seems reasonably well defined.

What you're doing here isn't obvious at first blush (even I started writing an alternative), but this comes back to a hard-learned lesson: Don't make "nothing" mean something. The absence of a value in a patch cannot mean "remove", it has to mean "don't know".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, hence the explicit "null". One alternative in a prior comment thread in this PR was an explicit remove: true (but then what does remove: false mean?) or remove: {}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

8 participants