KEP-5986: Per-container memory pressure eviction#6141
Conversation
sohankunkerkar
commented
Jun 2, 2026
- One-line PR description: Add KEP for per-container memory pressure detection using memory.events counters and PSI with two-stage remediation (relax memory.high, then evict)
- Issue link: Per-container memory pressure eviction #5986
- Other comments: This KEP depends on KEP-2570 (MemoryQoS) which sets memory.high on container cgroups
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds KEP-5986 documentation for a new kubelet feature that detects per-container memory pressure (via memory.events + PSI) and performs a two-stage remediation (relax memory.high, then evict) behind the MemoryHighEviction feature gate.
Changes:
- Introduces KEP metadata (
kep.yaml) for KEP-5986 under SIG Node. - Adds the full KEP design doc (
README.md) detailing detection signals, remediation flow, config, metrics, and PRR answers. - Adds the production readiness stub (
keps/prod-readiness/sig-node/5986.yaml) for PRR tracking.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| keps/sig-node/5986-memory-high-eviction/kep.yaml | Registers KEP metadata and the MemoryHighEviction feature gate milestone/stage. |
| keps/sig-node/5986-memory-high-eviction/README.md | Provides the complete enhancement proposal and PRR questionnaire responses. |
| keps/prod-readiness/sig-node/5986.yaml | Adds the PRR tracking YAML entry for the KEP. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d7f6c07 to
5078934
Compare
5078934 to
af72ccf
Compare
af72ccf to
40c8893
Compare
40c8893 to
a41eb79
Compare
PRR Alpha Review Summary — KEP-5986: Per-container memory pressure evictionFeature Gate: Alpha Requirements: All Pass
Strengths
Minor Items to Address
Known Limitations (Acceptable for Alpha)
VerdictThe KEP meets all alpha PRR requirements. The design is well-reasoned with strong safety guarantees for rollback and incremental adoption. |
|
Really nice PRR work for alpha. Very minor nit. /approve |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kannon92, sohankunkerkar The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| - Stage 1 (relaxing `memory.high`) modifies container cgroup state, unlike | ||
| existing eviction which is read-only until the kill. | ||
|
|
||
| ## Alternatives |
There was a problem hiding this comment.
an alternative @kannon92 hinted at offline could be an external eviction controller that triggers an api level eviction. that may be too slow, so another alternative could be an eviction endpoint in the kubelet that the controller would hook into. I think structurally eviction code is a bit brittle and I do think it's worth considering whether want to continue to extend in-tree. Can you put an agenda item in sig node next week to discuss @sohankunkerkar ?
There was a problem hiding this comment.
Agree, worth discussing at sig-node. No kubelet eviction endpoint exists today as building one needs its own design. An external controller still needs a node-local actor for stage 1 CRI writes, so it's a multi-KEP architecture split. This KEP follows the existing localStorageEviction parallel check pattern. It can proceed at alpha while the broader architecture discussion happens, but the two are orthogonal.
There was a problem hiding this comment.
If we draw parallels to localStorageEviction, should there be a pod limit (like ephemeral-storage)?
There was a problem hiding this comment.
Actually, memory.high already is the per-container limit. MemoryQoS computes it from the pod spec's memory request and limit and sets it on the cgroup. localStorageEviction needs ephemeral-storage limits because there's no kernel enforcement. Here the kernel enforcement (memory.high throttling) already exists. This KEP adds detection and remediation for when that throttling becomes sustained.
There was a problem hiding this comment.
BTW an advantage to an external controller is better handling of https://github.com/kubernetes/enhancements/pull/6141/changes#r3358885558. In fact, something like KEDA could be close to being able to do this today. it may not have the knobs to trigger an api eviction, but that may be an easier extension in the grand scheme of changes, and it could better handle scaling up important pods, vs triggering an eviciton for less important pods
| Stage 1 needs to call CRI `UpdateContainerResources` to relax | ||
| `memory.high`. The eviction manager does not have direct CRI access. A |
There was a problem hiding this comment.
I kind of think this should be tiered similar to memory protection. a guaranteed pod or high priority pod should have its memory.high increased, but a best-effort probably shouldn't. further, how do we make a signal to VPA that the memory limit of the pod should increase as well?
There was a problem hiding this comment.
There was a problem hiding this comment.
Guaranteed pods are exempt (no memory.high set). QoS-tiered stage 1 policy and VPA signal are not in alpha today, but good candidates for beta refinement.
| - A warning event with reason `Evicted` and a message indicating which | ||
| container triggered the eviction and which signal was exceeded (e.g., | ||
| "Container foo exceeded memory.high throttle threshold") | ||
| - Annotations: `OffendingContainersKey` (throttled container name) and |
There was a problem hiding this comment.
annotations on what? the pod? that seems odd imo. probably something in status would be better
There was a problem hiding this comment.
These are event annotations (AnnotatedEventf), not pod annotations. They're the same schema the node-pressure eviction path uses (OffendingContainersKey, StarvedResourceKey). The primary status signal is the DisruptionTarget pod condition.
| evictions. Coupling two alpha features would create fragility since | ||
| EvictionRequest's API may change before beta. | ||
|
|
||
| At beta, the stage 2 eviction step routes through EvictionRequest, giving |
There was a problem hiding this comment.
who routes? the kubelet? seems at risk of timing issues where the kernel may act too fast
There was a problem hiding this comment.
also this is another reason to me to have this managed by an out of tree controller. tbh
There was a problem hiding this comment.
At alpha, no EvictionRequest routing. Stage 2 uses the same direct kubelet eviction path as all existing evictions. On timing, memory.high throttles but doesn't kill. The kernel only kills at memory.max/OOM, which exists with or without this feature.
There was a problem hiding this comment.
also this is another reason to me to have this managed by an out of tree controller. tbh
Even with external detection, stage 1 still needs a privileged node-local actor with CRI access for UpdateContainerResources. Stage 2 via API adds a control-plane hop. So the controller becomes another privileged node-local resource manager with its own stats pipeline and cgroup writes.
|
|
||
| #### Stage 1 Dependencies | ||
|
|
||
| Stage 1 needs to call CRI `UpdateContainerResources` to relax |
There was a problem hiding this comment.
btw a similar thing could be achieved by just increasing memory limit. then the memory.high would be increased (right?) and the overall goal of making the container less close to OOM would be acheived
There was a problem hiding this comment.
I think stage 1 only removes memory.high, which is a throttle boundary the kubelet set via MemoryQoS. memory.max stays unchanged, pod spec stays unchanged. Increasing the memory limit is an API-level operation that changes memory.max, scheduling, and quota. The eviction manager has no API client to do that, and shouldn't. Stage 1 is scoped to undo what the kubelet did. If the workload stabilizes below memory.max, no eviction needed.
a41eb79 to
26ef134
Compare
| - Annotations: `OffendingContainersKey` (throttled container name) and | ||
| `StarvedResourceKey` ("memory") | ||
|
|
||
| Stage 1 emits a warning event indicating that `memory.high` relaxation was |
There was a problem hiding this comment.
Should we consider adding a condition or a status field to indicate that a Pod is under high memory pressure after the first stage is triggered but before eviction? This could allow automated operators to trigger a Pod resize, especially since events are not always reliable.
There was a problem hiding this comment.
Good point! I added a ContainerMemoryPressure pod condition at stage 1. It's queryable and watchable, unlike events which can be missed.
26ef134 to
f5caccd
Compare
Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
f5caccd to
586edfd
Compare
|
Some pieces of this feel a bit funky to me (changing eviction mgr semantics) but an overarching worry I have is this is actually possible with an external controller after we get eviction request API (or even today with triggering a graceful termination of a pod). An external controller could use the existing metrics exposed (like KEDA does today) and:
I'm not currently seeing a reason why kubelet needs to be in charge of this. the best answer I have is it'd be faster, and thus potentially lower the likelihood the kernel steps in, but I'd want to see that in practice before feeling certain that's enough of a problem to warrant changing the existing eviction semantics to keep state like this is proposing we have an item on sig node agenda tomorrow, i'll voice my concerns there as well. Sorry for not catching this earlier, I initially +1'd you opening it because I didn't think thoroughly enough about it |
Thanks for the thoughtful feedback Peter, and no worries about the timing. This is exactly the kind of discussion that helps make KEPs better. I have a few thoughts I wanted to share below: On the eviction manager semantics concern: I agree this is different from the current signal-based eviction flow because it introduces stateful tracking across sync cycles and grace periods. I’ve been prototyping an On the external controller idea: I agree stage 2 (eviction/termination) could already be implemented externally using graceful termination. For the detection + stage 1 remediation side though, I could not find an existing controller that covers this. KEDA is horizontal only, Kedify PRA supports in-place resize but based on utilization thresholds rather than memory.events + PSI throttle rates, and VPA does not react to real-time throttling. So the controller would still need to be designed and built for this use case. The kubelet already has the stats pipeline, CRI access, and eviction path, so for alpha it seemed like the shorter path to validating the behavior. On IPPR vs stage 1: IPPR being GA definitely helps, but it is solving a different problem. IPPR changes pod resources through the API, goes through admission/quota flows, and persists as part of pod state. Stage 1 is intentionally temporary and local to the node. It only writes On latency: I agree I do not yet have production data showing the latency difference is important. That is a fair point. My thinking is that alpha is the right place to validate this. We can measure how quickly the in-tree path reacts to throttling compared to what an external scrape/reconcile loop could realistically achieve. On graduation: the current KEP already plans to integrate I think the main question is whether we want to ship the in-tree detection + remediation path behind a feature gate for alpha now, or wait for a separate external controller design and implementation first. |
that's fine, it's not terribly hard to write a controller. doing so externally speeds up development time and reduces the restriction to tie with k8s releases.
good point, but the point still stands that burstable pods could be resized and get a different memory.high value. I think it's a bit weird to have some pods within the same qos class have memory.high and some not, especially because a restart will cause it to flap. better to persist in the api so a restart brings back to the old allocation. This is extra reason to me to go through api rather than sneak behind it |
|
/cc |
|
After discussing with @haircommander , we're pursuing an external operator for detection and remediation. The operator uses cadvisor pressure metrics (memory.events:high, PSI) for detection and InPlacePodResize for the first remediation step (raise the memory limit so kubelet recomputes memory.high). When IPPR doesn't resolve pressure (max attempts reached, admission rejection, or node capacity), the operator will escalate via the EvictionRequest API (KEP-4563) once it graduates, providing PDB-aware graceful eviction. This covers the core detection + remediation flow but outside the kubelet. The stats API extension and ContainerMemoryPressure pod condition from this KEP can be revisited as separate proposals if needed. We'll evaluate the latency tradeoff with real numbers from the prototype and share results. The KEP stays open for now. If the external approach proves insufficient, we'll revisit with evidence. |