-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass down resources to CRI #4113
base: master
Are you sure you want to change the base?
Pass down resources to CRI #4113
Conversation
marquiz
commented
Jun 28, 2023
- One-line PR description: KEP for extending the CRI API to pass down unmodified resource information from the kubelet to the CRI runtime.
- Issue link: Pass down resources to CRI #4112
- Other comments:
Co-authored-by: Antti Kervinen <[email protected]>
@marquiz: GitHub didn't allow me to request PR reviews from the following users: fidencio. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 2; | ||
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the keys here be a special type instead of unstructured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible to have smth like type ResourceName string
in protobuf. Please correct me if I'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @haircommander, are you satisfied with the reply (close as resolved)?
/retitle Pass down resources to CRI |
@marquiz We need to check how this will work with DRA and CDI devices. If we have enough information to know which devices need to be added to the sandbox just by the resource claim name. |
@marquiz There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marquiz.
It would be good to get more concrete details on the use cases that this would enable.
There is also the question of complex devices that are managed by device plugins where there isn't a clear mapping from the resources
entry (e.g. vendor.com/xpu: 1
) to the resources added to the container, or DRA where the associated resources.requests.claims
entry is not mentioned.
|
||
#### Story 3 | ||
|
||
As a cluster administrator, I want to install an OCI hook/runc wrapper/NRI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on this use case? How does extending the CRI translate to modifications in the OCI runtime specification which is interpreted by runc (or wrappers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CRI changes (in this KEP) would not directly translate to anything in the OCI config. It's just "informational" that a possible hook/wrapper/plugin can then use to tweak the OCI config. Say you want to do customized cpu pinning in your plugin. I'll come up with some more flesh on this section...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar I updated Story 3, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @elezar may I close this as resolved?
requests: | ||
cpu: 100m | ||
memory: 100M | ||
vendor.com/xpu: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification: This does not indicate the properties of the resource that was actually allocated for a container requesting one of these devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very much true. I think I'll add a note about this in the KEP somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar I added a not about device plugin resources after this example. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @elezar may I close this as resolved?
WindowsPodSandboxConfig windows = 9; | ||
+ | ||
+ // Kubernetes resource spec of the containers in the pod. | ||
+ PodResourceConfig pod_resources = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeZappa87 since you shared recently something along these lines for the networking capabilities, this KEP also means to interface with NRI
Another point to consider is how we're going to integrate or not these enhancements with the new containerd Sandbox API. |
@zvonkok that one is just the native resources and gives the resources in the "obfuscated" form i.e. not telling the actual reqeusts/limits (plus it's for Linux resources only). I think we wouldn't, or even couldn't, touch this, i.e. keep it. |
Just for reference linking this old KEP here: #3080, looks like there were some comments that CDI may not be the complete solution to some use-cases. We do not want to break anything by relying only on CDI when passing down the devices. |
The UpdatePodSandboxResources CRI message is also updated when/if that is | ||
introduced by the [KEP-1287][kep-1287] Beta ([PR][kep-1287-beta-pr]). | ||
|
||
Information about the Pod-level resources are added when/if the Pod-level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndixita can you please confirm it is tracked in your KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-phrased this paragraph to
[KEP-2837][kep-2837] Alpha ([PR][kep-2837-alpha-pr]) proposes to add new
Pod-level resource requirements field to the PodSpec. This information will be
be added to the PodResourceConfig message, similar to the container resource
information, if/when KEP-2837 is implemented as proposed.
The intention is to sync with the other KEPs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not tracked in Pod Level Resources KEP. While I am trying to understand this KEP, I have a question: Will the VM based runtimes benefit from knowing the overall requirements of all the containers in a pod and not requirements of individual containers? Again, this question is just for my understanding...
request. | ||
|
||
```diff | ||
message PodSandboxConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how big this structure will be? What would be the max number of containers we can report in this structure? (what is the message size limit today?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By quick search the gRPC message size limit is 4MB. This is a lot more than the maximum size of a Pod object in Kubernetes. I believe we should not hit a limit in any practical scenario. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4Mb my be ok.
- rename enum REGULAR_CONTAINER -> CONTAINER - update references to other related keps under Design Details
WindowsPodSandboxConfig windows = 9; | ||
+ | ||
+ // Kubernetes resource spec of the containers in the pod. | ||
+ PodResourceConfig pod_resources = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what runtime suppose to do when this does not match values provided in resources
of LinuxPodSandboxConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no overlapping information between these two structures. Thus a conflict should not be possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand. Maybe I am missing something here, but you should be able to calculate what is in resources
based on pod spec. Is it correct? So some implementation may take dependency on specific way one is calculated. Thus the question, what if they will not be matching at some point. Also if they are inter-dependent, perhaps some of it may be deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime is not calculating resources of LinuxPodSandboxConfig
. They are set by the kubelet
|
||
The resources (mounts, devices, CDI devices, Kubernetes resources) in the | ||
CreateContainer request should be identical to what was (pre-)informed in the | ||
RunPodSandbox request. If they are different, the CRI runtime may fail the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would we separate cases when VM was restarted and we don't really know what was originally passed to the sandbox creation any longer? I do not understand this consistency check argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this paragraph seems outdated, especially in the light of the in-place pod updates. The paragraph was added as a response to some review feedback (I just couldn't find that comment now)
How about re-phrasing it for example:
In some usage scenarios the container creation may fail if
the resources (mounts, devices, CDI devices, Kubernetes resources) in the
CreateContainer request do not match what was (pre-)informed in the
RunPodSandbox request. This may be the case e.g. in when
changes are not allowed after a VM-based Pod has been created.
WDYT @SergeyKanzhelev ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe he means the original pod metadata plus any pod updates must be applied to the pod before the container is created.
### kubelet | ||
|
||
Kubelet code is refactored/modified so that all container resources are known | ||
before sandbox creation. This mainly consists of preparing all mounts (of all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a possibility that today some mounts can ONLY be created when container is being created (some thing that init container needs to initialize). Will this break this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was discussed/investigated in #4113 (comment)
I think there should breakages because of this.
WDYT @SergeyKanzhelev ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When appropriate, image mounts included in the container config are added at create container time when we build the rootfs.. additionally oci level plugins, nri plugins, cdi, and cri proxy tools may add mounts. But these are not the "extra" mounts passed in from the host level CRI client (kubelet), additionally the OCI image volume mount is added by the container runtime during create container. In the OCI volume case this is why I'm asking for the oci volume image reference..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Mount
structure is exactly the same (with the same information) that what will be passed as part of the CreateContainerRequest/ContainerConfig message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that works..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually goes back to this comment: #4113 (comment)
Would it be expected that runtimes will need to mount everything at pod sandbox creation time now? Or the behavior is not defined? Will we require that the old way (without passing all this extra context in sandbox creation) should continue working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtimes do not need to change the way they operate. They don't need to mount everything at sandbox creation.
All of the data is something that the runtime CAN use if it needs it. E.g. this paragraph of the proposal tries to state that. Maybe we need to state that more clearly in more places(?) And certainly comments in the cri proto file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CAN part worries me. If something CAN be used the only assumption we can make is that it IS BEING USED. This is why maybe we should have two modes of operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev as it was mentioned above, this information is really needed in some scenarios, and OCI runtimes that are dependent on that information will be using it, to fix current problem. Other runtimes that don't have problems now are expected to ignore it.
Besides comments above, I think it is important to clarify and even test as part of critest the requirement for Container Runtime. It goes to this question: #4113 (comment) Some critest ideas might be:
I understand the challenge with the VM-based runtimes, but also worried that we are going against the direction of making Pod more dynamic. All the duplication and mismatch of what we thought Pod and contianers were before and what they will be is scary to me. Another alternative would be to have two modes of runtime. One mode with all information available, and every dynamic change recreates the Pod. And another mode - what we have today. And runtime will need to tell on startup which mode it wants to work in. |
Can you elaborate on the direction of "making Pod more dynamic"?
What are you scared of, or what are you imagining where this is going?
We will have more and more sandboxed environments that will need such information. The use case with Confidential Containers and GPUs (or any other accelerator) is not realizable with the current architecture. Additionally, there is HW that you cannot hot-plug during runtime, and you need a static configuration at sandbox creation time. |
This thing is targeted to solve the problem that CoCo VMs have (and impossible to fix without additional info passed down), and not trying to deprecate or change meaning of any other Pod lifecycle messages or events meaning.
If there are no bugs in the kubelet code, there should be no mismatches between raw data derived from pod spec with fields calculated to
Device and mount paths are provided at pod sandbox creation time as additional information. At the time when
In fact, we are not going against direction of pod and containers to be dynamic. It is the opposite: in every piece of additional data passed, we are reflecting current state of the object. e.g. "At pod creation we know this much info of workload", on "container creation, something already got changed, so here is current state about that container.". This allows us to do righsizing of VM during creation including all possible heuristics based on information known at that time, as well failing creation of container on (re-)start if something changed since
We don't really need two modes. We just need enough information to make decisions at existing pod lifecycle events, as well proper error handling for scenarios where dynamic resize of the pod or container is not feasible due to some reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for moving forward with the KEP under the proviso that a follow up KEP update is required to sync with 1287 and 2837, to cover message/api details not yet discovered, and other dangling issues still in open discussion. I also expect us to find needed additions when we have a cc prototype.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: haircommander, marquiz, mikebrow The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Interesting points. Mode works to a certain degree and we already have handler as just such a mode selection. I think the kep will definitely need to be updated to handle the dynamic pod resource changes.. (new networks, different resource claims/devices) and when a handler is found to not be dynamic.. we'll just have to cycle the pod by returning error handler does not support this update type and / or tainting the handler type ahead of time. |
@SergeyKanzhelev @mrunalp How to move this forward? |
Hi folks, any updates of this KEP ? Any ideas to help move it forward ? |
- update kep.yaml to target v1.33 - update references to kep-1287 and kep-2837
Updated:
The related changes in other referenced KEPs (#1287, #2837) are now merged which hopefully makes this slightly easier read. Those changes/KEPs (for Kubernetes v1.32) are still mentioned in the corresponding sections. |
Including the KubernetesResources in the ContainerConfig message serves | ||
multiple purposes: | ||
|
||
1. Catch changes that happen between pod sandbox creation and container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in place resize should be writing a message to the runtime that a pod or container had changed resources, right @tallclair ? I think that'd be better than having a runtime interpret diffs in container values...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it do so when the container hasn't yet been created. IIRC the thinking behind this case were the kata/coco use cases. The runtime or nri plugin whatever CAN detect changes if they need/want to.
@SergeyKanzhelev Are all your questions anwsered, are you happy with the current shape of the KEP? If yes, would you mind revoming the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous PRR still looks good.
@SergeyKanzhelev PTAL |
any information about the resources (such as native resources, host devices, | ||
mounts, CDI devices etc) that will be required by the application. The CRI | ||
runtime only becomes aware of the resources piece by piece when containers of | ||
the pod are created (one-by-one). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of this information must come from the Kubelet, and which can be read from the k8s API? Resource requests & limits and volumes can all be read directly from the pod object. Do they need to be plumbed through the CRI, or can you just read from the k8s API instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to have clear separation between layers with normal clear top-to-bottom information flow between components. Kubelet knows current state of the object, so it provides whole state of that object down to lower layers. This allows lower layers to use information immediately in consistent, transactional way. This eliminates ways of delaying reply to CRI grpc transaction due to need of creating another grpc call to outside service(s). We want to eliminate scenarios where different projects are using some other backdoors or out-of-band communications to fetch information needed. Examples are different CNI providers that based on Sandbox ID are trying to connect to k8s api server to get whole pod object or similarly, some CNI providers are connecting back to kubelet with "hey, kubelet, you send me CRI message, but didn't provide enough information, but I know in your PodResources API you have that information for that pod, let me fetch it from there" manner.
So, the whole idea to have clean separation between layers:
- kubelet is resposible to get higher level object from API server, prepare it to run, and give it to runtimes in full descriptive way.
- Runtimes should not be doing calls from bottom layers back to upper layers (kubelet, api server) to get missing pieces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to that, for mounts we want more/different information than what is available on the Kubernetes API level. At least the host_path
that will be mounted to the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional to the other points, in Kata/Confidential Containers we cannot handle subPath
in volumeMounts
because subPath is handled completely by the kubelet. https://github.com/kubernetes/kubernetes/blob/d7774fce9a7fcec890d7c0beffacd6ae34152b01/pkg/volume/util/subpath/subpath_linux.go#L175