KEP-5972: Dynamic Containers#6169
Conversation
|
|
||
| **Mitigation:** This change corrects a race condition/atomicity violation, ensuring Pod updates are | ||
| treated as proper transactions (we shouldn't run a new image if we can't secure the resources for | ||
| its accompanying spec). We will explicitly document this behavioral shift. |
There was a problem hiding this comment.
we should be able to catch this situation with metrics too, right? may be worth exploring to make clearer to admins when this has happened
There was a problem hiding this comment.
We could, I'm not really sure it would be worth it though. Resize via pod update is a new capability so the only case we're really changing is the following:
Pod is resized (via /resize), and the resize is deferred, and THEN the container image is updated.
If the image is updated before the resize, we would expect that to go through in both cases (unless the Kubelet missed the first update and received both at the same time).
Updating container images is already a pretty niche use.
A dedicated metric feels like overkill. A log line feels more appropriate. WDYT?
| ups) significantly increases the API surface of the mutable / allocatable portion of the Pod spec, | ||
| and simply mirroring these fields into the pod status will drastically inflate the size of the pod | ||
| object. Similarly, duplicating every pod into an `AllocatedPod` object would heavily inflate the API | ||
| server object count and data size. |
There was a problem hiding this comment.
is there then follow-up work with sig-cli and other clients to reflect pod status differently, now that we're not writing back updates? historically users used pod spec to find the state, but it sounds like there won't be spec reflection ?
There was a problem hiding this comment.
I'm not sure I understand the question.
... now that we're not writing back updates?
Are you referring to the proposal to have a local API that bypasses the KCP? If so, that is out-of-scope for this KEP.
The only way pod spec and allocated spec drift is very briefly after the pod is updated but before the Kubelet has admitted the update, or if a resize (including adding a new container with resources) is deferred (or infeasible).
|
|
||
| **Admission Configuration:** If custom admission webhooks are configured to intercept only `CREATE` | ||
| operations for Pods, they can be bypassed by adding a violating container via update. This risk will | ||
| be managed with user communications, but there won't be technical safeguards preventing it. |
There was a problem hiding this comment.
maybe a bit meta, but we could create an admission configuration that mandates pod webhooks also run on UPDATE? (I think, I actually am not positive that's possible but it seems like it could be)
There was a problem hiding this comment.
I can't think of any way to make this work. The issue is that webhooks can have side effects, so this would be a breaking change.
Or did you mean just provide something that users could optionally run to audit their clusters?
| authors: | ||
| - "@tallclair" | ||
| owning-sig: sig-node | ||
| participating-sigs: [] |
There was a problem hiding this comment.
I was chating with @elmiko about this KEP offline.
Have you floated this idea with sig-scheduling or sig-autoscaling?
It seems a pretty large change to go without involving other SIGs.
There was a problem hiding this comment.
I don't think this impacts much sig-scheduling since from day one, sig-scheduling treats pods as the smallest scheduling units, and they are doing the same with this proposal. Same / similar with today's autoscaling. But I do agree after the feature introduced to K8s, we can do better for autoscaling.
There was a problem hiding this comment.
From other hand, I agreed that this movement has concerns on the potential security implications and API exhaustion. We did talk to both teams.
There was a problem hiding this comment.
I don't think this impacts much sig-scheduling since from day one, sig-scheduling treats pods as the smallest scheduling units, and they are doing the same with this proposal.
my main scenario I want an answer on:
Say you have a node with 8 cores and someone requests a pod with 1 core.
- User resizes by adding containers and adds a container that uses 8 cores (pretend we can use entire node).
Will the scheduler be notified that the pod requests don't match the actual admitted value?
And then future pod requests will correctly detect that the pod is fully occupied?
Or will we get cases where pods see 1 core and then try to place more pods on that node. Only to be rejected at kubelet admission time.
Either way I think a scenario like this is worth exploring / testing on.
There was a problem hiding this comment.
Say you have a node with 8 cores and someone requests a pod with 1 core.
- User resizes by adding containers and adds a container that uses 8 cores (pretend we can use entire node).
Will the scheduler be notified that the pod requests don't match the actual admitted value?
And then future pod requests will correctly detect that the pod is fully occupied?
Yes, the scheduler gets the update immediately, potentially even before the Kubelet has actuated the resize. We call the resources described by the pod spec the "desired" resources, and the pod status also includes the allocated (what the Kubelet has admitted or "agreed to"), and actual. The scheduler uses the maximum of those three different resource values for each resource. So, as soon as the pod is resized (by adding the container), it's "desired" resources change. The scheduler gets the update through it's pod informer, and updates it's cached node info to reflect the new resource total for the node. New pods that go to be scheduled will do so based on the updated total. This is exactly how it works for pod resizes today, and this KEP doesn't really change anything here.
Or will we get cases where pods see 1 core and then try to place more pods on that node. Only to be rejected at kubelet admission time.
There is a race condition, where a pod is resized at the same time as a new pod is scheduled to the node, which can result in the scheduled pod being rejected by the Kubelet. We've discussed this at length and have a few potential solutions, but empirically found that the race window is so small that we decided it isn't a priority right now. See https://docs.google.com/document/d/1VdE-yCre69q1hEFt-yxL4PBKt9qOjVtasOmN-XK12XU/edit?resourcekey=0-KJc-YvU5zheMz92uUOWm4w&tab=t.0 for the latest discussion.
/cc @natasha41575
Either way I think a scenario like this is worth exploring / testing on.
Yes, but we've already done a lot of that. We can definitely continue to do so, but I think that should all be considered outside the scope of Dynamic Containers.
We should think of Dynamic Containers as a use case for pod resize, rather than something else that impacts scheduling and autoscaling.
There was a problem hiding this comment.
this KEP
This is a discussion for resize, not this KEP. This KEP doesn't propose any changes to the resize functionality.
Will there be a doc/KEP update with these findings?
There are many docs, but @tallclair linked the most relevant one above: https://docs.google.com/document/d/1VdE-yCre69q1hEFt-yxL4PBKt9qOjVtasOmN-XK12XU/edit?resourcekey=0-KJc-YvU5zheMz92uUOWm4w&tab=t.0#heading=h.clxvs733rwyx
There was a problem hiding this comment.
FWIW, we discussed with the sig-scheduling leads, the sig-autoscaling leads, and the kueue maintainers prior to IPPR GA, and had broad agreement that these issues don't need to be resolved now.
There was a problem hiding this comment.
Yes, the proposed changes does not impact scheduling directly, so we treat is as FYI.
There are still a few points worth to bring up
- The race between kubelet and scheduler becomes more prominent, so we should prioritize addressing it as part of WAS, since IPPR is already GA and wasn't considering WAS integration as a blocker.
- Obviously the change allows to bypass the scheduling, so the scheduling constraints we develop as part of WAS won't be useful when pods are resized.
- We don't have that yet, but if it happens that WAS introduces runtime constraints (constraints which have to be met not only during initial scheduling, but also runtime), then the problem of making arbitrary decisions outside of scheduler may become much more problematic.
There was a problem hiding this comment.
This is a discussion for resize, not this KEP. This KEP doesn't propose any changes to the resize functionality.
I guess to me the main thing I wonder is will this feature make the race condition worse?
We are accepting this race condition it seems and accepting this a problem that will be addressed in separate work?
There was a problem hiding this comment.
The only reason this would make the race condition worse is that it's potentially adding more use cases for resize. This doesn't change anything to make the race more likely, it just potentially adds more resize QPS.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tallclair The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| ecosystem that pods are fundamentally immutable execution envelopes. This will require updates to | ||
| third-party controllers, service meshes, and logging tools. | ||
|
|
||
| ## Alternatives |
There was a problem hiding this comment.
The vision for this KEP seems to share some common ground with the new project @thockin is working on: https://github.com/agent-substrate/substrate. Could using a third-party project like substrate be considered as an alternative?
There was a problem hiding this comment.
Substrate is a potential client of this proposal. I think the alternative here is scheduling tasks at the application layer, totally outside of Kubernetes. That's the approach that substrate is currently taking, and the same for Ray. Part of the idea of Dynamic Containers is offering a way to pull those tasks down to the Kubernetes layer (as containers) so that they can gain all the benefits that come with that.
There was a problem hiding this comment.
The use cases seem very strongly in the application layer to me ...
There was a problem hiding this comment.
I had read the majority of the use cases in the KEP as being critical to support projects like substrate, but I was trying to see if we could make the capability an opt-in behavior at the Pod level so not all other workload controllers need to reason around it similar to how we had to handle performance sensitive workloads and/or telco scenarios.
In addition, it would be ideal in my opinion if expression of this new behavior could be restricted via RBAC via a dedicated handler so the workload controllers that depend on this capability can acquire this power rather than all other workload controllers/actors that can edit a Pod today do not inherit it by surprise.
I do worry there may be unintended side effects not fully reasoned through otherwise. Both ephemeral containers and abuse of existing debug handlers (exec/attach) have caught us by surprise, so being able to limit that potential surprise is my primary concern.
| - Device attachment & initialization | ||
| - Generic application-level initialization (via InitContainers) | ||
|
|
||
| Decoupling these elements from workload startup will be critical in achieving the sub-100ms latency required for interactive agentic workloads. |
There was a problem hiding this comment.
achieving the sub-100ms latency - for startup? all operations?
There was a problem hiding this comment.
Request to starting code execution is the metric we care about. This is not yet achievable without changes at the runtime layer, but with v1.36 changes we're close to this target for container resize operations (assuming available resources).
|
|
||
| #### The Agentic Sandbox (AI-Native) | ||
|
|
||
| An "Agent" container spawns ephemeral tool-execution sandboxes on-the-fly within preallocated pods. |
There was a problem hiding this comment.
should be:
An "Agent" orchestrator
instead of:
An "Agent" container
Right?
There was a problem hiding this comment.
I think it's a bit fuzzy. There are a few different architectural models for this, and I think what "agent" refers to is a bit fuzzy... I just added an or here.
| Kubelet will not make any further allocations once the pod has entered a terminated (or | ||
| terminating) phase. | ||
| - The pod must be in an initialized state (i.e. all init containers have completed) before any dynamic container changes can be made. | ||
| - This proposal does not allow for general container mutation. A corollary is that a container |
There was a problem hiding this comment.
can new containers with new unique names be added? if so what is our upper limit as to how many?
what happens when there aren't enough cpu/mem, do we keep the static stability of what was there before and only the new call to /dynamic fails (and pod stays Running)?
There was a problem hiding this comment.
can new containers with new unique names be added?
Yes. That is part of the core proposal: add new containers with unique names, or remove existing containers, but it excludes arbitrary updates to existing containers.
if so what is our upper limit as to how many?
I didn't include one since we don't have an upper limit no create. We could use this as an opportunity to add a limit, but I don't know what a reasonable upper limit would be without benchmarking.
what happens when there aren't enough cpu/mem, do we keep the static stability of what was there before and only the new call to /dynamic fails (and pod stays
Running)?
Are you asking about requested resources? If so the answer is that we treat this as an in-place resize. If the new requests can't be allocated, the changes are put into a deferred state. See the section titled `### Allocation.
If you're asking about runtime resource contention, then there aren't any changes to the existing resource management mechanisms, including OOM killing, CPU scheduling & throttling, PSI and eviction.
There was a problem hiding this comment.
if so what is our upper limit as to how many?
We discussed this frequently on several topics (overcommitted node, pod density, etc. ), not limited to this one, and agreed to define a max containers per running pod at a given time. @tallclair Maybe we can initially start with 10 containers for mutable pod? No change to existing inmutable ones?
|
|
||
| #### Alpha | ||
|
|
||
| - Feature implemented behind a feature flag (`DynamicContainers`). |
There was a problem hiding this comment.
We expect this flag on apiserver + kubelet right? (anything else?)
There was a problem hiding this comment.
Yeah, just those 2. This is covered in the KEP metadata.
|
|
||
| This proposal does not introduce any new fields. The API changes are limited to two new subresources | ||
|
|
||
| 1. `/dynamic`: Enables the new update functionality, in addition to updates allowed on the main |
There was a problem hiding this comment.
any alternate names were considered? (for /dynamic)
There was a problem hiding this comment.
Yes, lots. I don't like dynamic, but I didn't like anything else better. If you have any suggestions I'm all ears...
/active: Dangling adjective; too generic/additions: Restrictive; ignores container removals./adjustments: Candidate for consideration./allocate: Verb (bad for GET); typo-prone next to/allocated./allocated: Updates are on the desired spec, not the allocated spec./allocatedPod: Seeallocated./allocation: Seeallocate/attachments: Works for volumes, but weird for modifying existing containers./augment: Weird for removels./augmentations: See augment./blueprint: Abstract; lacks K8s precedent./budget: Implies resource bounds only; ignores structural components./components: Too vague (pods have immutable components too)./composition: Abstract./config: Collides with ConfigMaps and component config./configuration: See config./desired: Candidate for consideration, but lacks precedent./dynamicAllocation/dynamicAllocations/dynamicConfiguration/dynamicEnvelope: Introduces non-standard vocabulary to core K8s./dynamicManifest: Overloaded term./dynamicPod: Path redundancy (stuttering); falsely implies an API Kind./dynamicSandbox: Doesn't capture non-sandbox changes./dynamicSpec: Inaccurate; metadata is also mutable, breaking thespecboundary./dynamicState: Too similar to.status./envelope: Introduces new vocabulary to K8s./execution: Colides with exec/extensions: Implies third-party plugins/flex: Colides with flex volumes./fluid: Unprecedented term, sort of weird./fluidAllocation: see fluid./intent: Abstract control-theory jargon./live: Colides withlivenessand other unrelated concepts./livespec: Seelive. Splits spec across sub-paths; inaccurate due to metadata changes./managed: Overloaded term./manifest: Overloaded term./modifications: Generic; lacks standard K8s API precedent./mutable: Candidate for consideration, but doesn't capture the difference from other mutable endpoints/mutate: Verb; semantically broken for GET requests./mutations: Implies history of mutations./objective/overrides: Similar issues tomutations, and implies layering./pending: Collides with thePendingpod phase/proposal/proposed/reallocate: Verb; semantically broken for GET requests./reallocation/reallocations/reconfigure: Verb/request(s): Collides with pod resourcerequests./revisions: Generic, similar issues tomutations/runtime: Collides with Container Runtime Interface (CRI)./runtimeAllocation: Collides with DRA and CRI./sandbox: Non-standard vocabulary for core API paths, implies pod sandbox changes./staged: Considered./target: Abstract./workload: Misses the storage/metadata aspect.
There was a problem hiding this comment.
what about hotswap ? or hotplug or something with the hot * concept or just hot ?
There was a problem hiding this comment.
my $0.02, something like /spec seems clearer than /dynamic to me as an indicator that it's allowing updating spec, even if it includes mutable metadata
There was a problem hiding this comment.
Maybe /runningSpec? That distinguishes it from updates to the base resource, although it's still confusing that you can modify container images through that... I'd also consider /proposedSpec or /desiredSpec, or maybe even /dynamicSpec.
There was a problem hiding this comment.
From the sig-auth discussion, there was an emphasis on wanting to solve this only once, and declaring that ALL mutations are possible in the future through this new endpoint. From that perspective, leaning into the mutability aspect on the name seems better. Maybe something like /mutable or even /fullymutable.
There was a problem hiding this comment.
I'm not sure we need to go overboard in making the subresource sound scary. Since mutations would still be constrained to spec and metadata, I'd still try to include spec somewhere in the name.
/spec, /dynamicspec, and /mutablespec all seem plausible to me (we don't have any mixed case subresources today, and lowercase ephemeralcontainers and portforward)
There was a problem hiding this comment.
Hoisting from another thread, if there's a field in the pod that toggles whether this is allowed, e.g. mutable: true, coordinating the subresource name with the name of that field could make sense.
For example, spec.mutable: true paired with /mutablespec, and calling update on /mutablespec on any pod without that field set to true errors.
(that leads to a meta question about whether you can mutate that field to false via an update on /mutablespec, which is fun 🤯 )
|
|
||
| The API server will resume accepting additions and removals to `.spec.containers`, and the Kubelet will resume reconciling the changes. | ||
|
|
||
| There are no storage changes in the API. |
There was a problem hiding this comment.
Note that @derekwaynecarr is proposing a field addition to designate pods as dynamic/mutable at creation time.
There was a problem hiding this comment.
@tallclair i defer on this point to others as well, but i think as we facilitate broader and broader mutability, i think some users would still like to preserve an intent for immutability on some of these items. similar to how we have resizePolicy we may want to explore mutationPolicy and let users say yes/no to particular types of runtime mutation (containers, volumes, etc.).
There was a problem hiding this comment.
It would certainly be easier to reason about a cluster if the pods which are subject to mutation this way are flagged via a field.
| Future extensions that enable additional pod dynamism and mutability (such as dynamic volume | ||
| provisioning) will be added to this same subresource. | ||
|
|
||
| #### Limitations |
There was a problem hiding this comment.
How are we enforcing this limitations?
I'd also expect some kind of description for this in our test plan.
There was a problem hiding this comment.
These are all enforced through API validation, and covered by the pkg/apis/core/validation unit tests (already listed)
I don't really see the parallel to writable cgroups. Is the connection that these are both sig-node projects with implications for admission control?
This is a much bigger discussion, but many of us are worried about Kubernetes being left behind and becoming irrelevant as the tech stack shifts if Kubernetes doesn't evolve to meet the new demands.
There seems to be some fundamental disagreement about the scope of SIG-Node and SIG-Architecture. Independent of what direction this KEP goes in, I think that's a bigger conversation that we need to have. |
The thread started in sig-architecture as a rally point we all pay attention to as a courtesy to you. The concerns of this KEP extend significantly beyond sig-node and should have been recognized as such and left open to community comment for a reasonable period. |
|
/assign @derekwaynecarr |
|
|
||
| This proposal does not introduce any new fields. The API changes are limited to two new subresources | ||
|
|
||
| 1. `/dynamic`: Enables the new update functionality, in addition to updates allowed on the main |
There was a problem hiding this comment.
my $0.02, something like /spec seems clearer than /dynamic to me as an indicator that it's allowing updating spec, even if it includes mutable metadata
| authors: | ||
| - "@tallclair" | ||
| owning-sig: sig-node | ||
| participating-sigs: [] |
There was a problem hiding this comment.
The impact of the direction this is proposing is really significant both within sig-node (in kubelet) and outside sig-node (potentially ~every pod-admitting policy, ~every pod-creating controller, ~every pod-observing controller). I'm surprised to not see any other sigs here.
I think folks outside sig-node (like myself) are trying to wrap their heads around:
- what changes are required for non-kubelet actors, are those changes feasible, and can we avoid doing that work until we're sure sig-node is committed to this path?
- do we have to ripple changes for this out to all responsibly-maintained pod-involved components, or are there ways to bound this (ideally by default) so that we only have to ripple out where it makes sense?
- what happens for all the pod-involved things that never update to consider this?
There was a problem hiding this comment.
For sig-node components, I suspect the effort to make kubelet behave well when ~anything about the pod can change is far bigger than currently described. We have a long tail of kubelet issues around pod lifecycle handling in weird edges (kubernetes/kubernetes#124701, kubernetes/kubernetes#116481, kubernetes/kubernetes#122161 are just a few off the top of my head), and this multiples the weird states it's possible to get into.
Is there a way to front-load the work with kubelet improvements that both make the current state better and more reliable (ideally tied to resolving long-lived lifecycle bugs) and would happen to make the kubelet handle scenarios like mutable pods better, but are not specifically tied to dynamic containers? I suspect there are probably some ~rewrites of kubelet subloops needed (for example, I'm not sure how many of it items identified in Kubelet state of the state are still outstanding)
If we merge an alpha where the pod can mutate, I expect to immediately start seeing merges of fixes targeted to compensating for weird scenarios we see in the kubelet specifically around pod mutation. Those will be difficult to back out if we decide not to proceed with the mutable pod feature.
Improving kubelet in master to correctly handle the transitions this introduces, so the ~only changes needed would be the more permissive spec-mutating subresource on the API side, would build confidence that things added under alpha weren't going to be difficult to back out if they didn't work out.
There was a problem hiding this comment.
@liggitt thank you for the comments, very constructive.
The impact of the direction this is proposing is really significant both within sig-node (in kubelet) and outside sig-node (potentially ~every pod-admitting policy, ~every pod-creating controller, ~every pod-observing controller). I'm surprised to not see any other sigs here.
Agreed with you on this and Derek also made the similar suggestion on bounding the blast radius for non-kubelet actors. Tim and I completely agree and to avoid a massive ripple effect across the ecosystem, as we noted on the SIG Arch mailing list, KEP was updated completely isolate this capability behind a dedicated API subresource rather than expanding the standard Pod UPDATE validation.
I will leave @tallclair on the design details. But I believe with that change, existing controllers, webhooks, and observing tools would experience zero behavior change.
Is there a way to front-load the work with kubelet improvements that both make the current state better and more reliable (ideally tied to resolving long-lived lifecycle bugs) and would happen to make the kubelet handle scenarios like mutable pods better, but are not specifically tied to dynamic containers? I suspect there are probably some ~rewrites of kubelet subloops needed (for example, I'm not sure how many of it items identified in Kubelet state of the state are still outstanding)
I will leave @yujuhong and @SergeyKanzhelev to answer your overall concerns on pod lifecycle fragility and reliability. But just want to point out that during developing in-place pod resizing and pod level resource, some those old bugs in pod lifecycle were identified and addressed.
On another hand, I completely agree that if we were introducing dynamic volumes or mutable DRA resource claims today, the Kubelet lifecycle complexity would be overwhelming. That is exactly why we explicitly made those non-goals for this initial alpha.
In fact, compared to recent features like Sidecar Containers or in-place pod restart or EvictionRequest API, this alpha proposal is intentionally minimized. By leveraging the exact same atomic checkpointing framework established by In-Place Pod Resizing, adding or removing a container at the Kubelet layer translates directly to existing CreateContainer and KillContainer primitives inside computePodActions. @tallclair please correct me if I misunderstood your proposal.
We are intentionally starting with this tightly bounded scope precisely so we can isolate and test the foundational Kubelet state transitions cleanly, without the added complexity of dynamic storage or shifting QoS boundaries.
There was a problem hiding this comment.
Thanks for your input. There's a lot to respond to here, so I'll take it one piece at a time.
The impact of the direction this is proposing is really significant both within sig-node (in kubelet) and outside sig-node (potentially ~every pod-admitting policy, ~every pod-creating controller, ~every pod-observing controller). I'm surprised to not see any other sigs here.
Yeah, after this discussion it's clear that I need to involve sig-auth. I'm just not sure what the order of operations are at this point. Does sig-auth input need to be alpha-blocking?
I think folks outside sig-node (like myself) are trying to wrap their heads around:
- what changes are required for non-kubelet actors, are those changes feasible, and can we avoid doing that work until we're sure sig-node is committed to this path?
How many of those actors are weighing in on KEPs though? Isn't beginning alpha development a stronger signal to start soliciting feedback?
- do we have to ripple changes for this out to all responsibly-maintained pod-involved components, or are there ways to bound this (ideally by default) so that we only have to ripple out where it makes sense?
I don't see any way around it. Either the component cares about things at the container level, in which case it will need to do work no matter where we put the dynamic containers, or how we control their availability, or the component doesn't care about containers, in which case there shouldn't be any issues with the proposal as is? Maybe I'm missing something.
- what happens for all the pod-involved things that never update to consider this?
How is this different form any other pod change from this perspective? In some ways, this seems like it would be less impcting since it isn't introducing any new API fields.
For sig-node components, I suspect the effort to make kubelet behave well when ~anything about the pod can change is far bigger than currently described. We have a long tail of kubelet issues around pod lifecycle handling in weird edges (kubernetes/kubernetes#124701, kubernetes/kubernetes#116481, kubernetes/kubernetes#122161 are just a few off the top of my head), and this multiples the weird states it's possible to get into.
Notably, those are all issues around pod startup and termination, neither of which are really affected by this KEP. We should take care around termination handling, but I don't anticipate issues there. As I said elsewhere, things like new pod phases or dynamic volumes changes would be a more complex change. The container interface is already largely driven as by an imperative state reconciliation process that already handles most container changes.
Is there a way to front-load the work with kubelet improvements that both make the current state better and more reliable (ideally tied to resolving long-lived lifecycle bugs) and would happen to make the kubelet handle scenarios like mutable pods better, but are not specifically tied to dynamic containers? I suspect there are probably some ~rewrites of kubelet subloops needed (for example, I'm not sure how many of it items identified in Kubelet state of the state are still outstanding)
We did a lot of that work with in-place resize, and this proposal is building on the machinery that was already put in place for that. There is ongoing work to move more components to drawing from the AllocationManager, and it will make more sense to prioritize that work with this KEP.
If we merge an alpha where the pod can mutate, I expect to immediately start seeing merges of fixes targeted to compensating for weird scenarios we see in the kubelet specifically around pod mutation. Those will be difficult to back out if we decide not to proceed with the mutable pod feature.
Improving kubelet in master to correctly handle the transitions this introduces, so the ~only changes needed would be the more permissive spec-mutating subresource on the API side, would build confidence that things added under alpha weren't going to be difficult to back out if they didn't work out.
This sounds like a good backup plan if we can't merge this KEP, but I'm still not convinced that we can't do both in parallel.
There was a problem hiding this comment.
I will leave @yujuhong and @SergeyKanzhelev to answer your overall concerns on pod lifecycle fragility and reliability.
While Kubelet still has some edge cases during state transitions, the issues mentioned above mostly relate to pod setup and teardown. Dynamic containers should have minimal impact on these phases, as container mutations are mostly bounded within the pod's running state. As Dawn pointed out, while this KEP adds complexity to the container lifecycle, previous features have already laid some groundwork and also fixed related bugs. Therefore, the overall risk to state transitions is not as extensive.
|
|
||
| 1. `/dynamic`: Enables the new update functionality, in addition to updates allowed on the main | ||
| resource, and `/resize`. | ||
| 2. `/allocated`: Returns the allocated version of the pod, served directly from the Kubelet. |
There was a problem hiding this comment.
Will controllers need to maintain this across all pods? Since this is a subresource, and I guess is pulled from the kubelet, it won't be able to be listed or watched.
There was a problem hiding this comment.
No. This is explicitly intended as a debug endpoint. If we find that controllers are needing to rely on it, we should build it in as a proper (sub)resource. See the section titled #### Allocated Subresource.
|
|
||
| Dynamic container mutation is allowed only through the new `/dynamic` subresource. Update | ||
| permission on this subresource is NOT granted to the default `edit` Role, and must be explicitly | ||
| granted by a `cluster-admin`. |
There was a problem hiding this comment.
Will any pod-managing controllers need to use this (the In-Place Upgrades (Sidecars, Daemons) user story makes it seem like they will)? If so, the permission will have to be granted to the controller (e.g. the replicaset or job or daemonset controller), effectively making it available to any user who has control over one of the higher-order API types.
There was a problem hiding this comment.
That's out-of-scope. I could imagine adding it as an update strategy on some pod-managing controllers (maybe DaemonSets?), but I think that would be its own KEP. The use case I was considering was a sidecar-injecting admission controller. If you want to upgrade those injected sidecars today, you need to recreate the workload (although maybe you could make something work with in-place image updates). This only applies to 3rd-party use cases though.
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| #### Third-Party Controllers that assume containers are static |
There was a problem hiding this comment.
This should be updated to reflect the latest design right?
There was a problem hiding this comment.
I don't think anything changed here, but I added an explicit note for admission controllers.
There was a problem hiding this comment.
I think @derekwaynecarr, @liggitt, @deads2k concerns are that we need a way to roll this out without basically tell the ecosystem to deal.
I am a hard no on this being solved with "extensive documentation and release notes". Very few people read these things and we can't expect the entire ecosystem to be aware that we are now saying pods are fully mutable.
I like the idea of a subresource but I think we need something stronger.
I see that this is a beta criteria so maybe update this section to reflect.
|
|
||
| ## Motivation | ||
|
|
||
| Dynamic Container mutation is the next step in an ongoing effort to transform Kubernetes into a more dynamic execution environment that is capable of meeting the evolving demands of modern AI workloads, including high-performance batch workloads leveraging special-purpose frameworks, and low-latency execution environments for agentic workloads. |
There was a problem hiding this comment.
Any more here on motivations? I'd be interested in seeing more on on why this is necessary
There was a problem hiding this comment.
Are the user stories what you're looking for? I don't really understand why the KEP template doesn't put user stories with motivation. Also note that the hierarchical scheduling and elimination of initialization overhead are the main drivers for this (those are meant to be part of the motivation section). Can you elaborate more on what additional detail you're looking for?
You might also want to check out @dchen1107's doc: https://docs.google.com/document/d/1J8Aq0XzN8BiNdWHXSEGA1Xw2nXcZRSKTMoi-tNh7FTc/edit?pli=1&tab=t.0#heading=h.apbhfrpbxmw0
|
My takeaways from the SIG-Auth discussion:
See SIG-Auth Meeting Agenda for detailed notes. @deads2k @liggitt @enj @micahhausler anything you want to add? |
One-line PR description: Introducing Dynamic Containers: making containers mutable.
Issue link: Dynamic Pod Mutation: Optimistic Execution & Hierarchical Resource Delegation #5972
Adapted to KEP format from https://docs.google.com/document/d/1LUttwty4xIU9gcLGSvnu1qVqAztxT3kl53ZdpM8q7o0/edit?tab=t.0#heading=h.px3kmst2742n
/sig node
/assign @dchen1107 @haircommander
/label api-review