KEP-5986: Per-container memory pressure eviction by sohankunkerkar · Pull Request #6141 · kubernetes/enhancements

sohankunkerkar · 2026-06-02T17:56:47Z

One-line PR description: Add KEP for per-container memory pressure detection using memory.events counters and PSI with two-stage remediation (relax memory.high, then evict)

Issue link: Per-container memory pressure eviction #5986

Other comments: This KEP depends on KEP-2570 (MemoryQoS) which sets memory.high on container cgroups

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds KEP-5986 documentation for a new kubelet feature that detects per-container memory pressure (via memory.events + PSI) and performs a two-stage remediation (relax memory.high, then evict) behind the MemoryHighEviction feature gate.

Changes:

Introduces KEP metadata (kep.yaml) for KEP-5986 under SIG Node.
Adds the full KEP design doc (README.md) detailing detection signals, remediation flow, config, metrics, and PRR answers.
Adds the production readiness stub (keps/prod-readiness/sig-node/5986.yaml) for PRR tracking.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
keps/sig-node/5986-memory-high-eviction/kep.yaml	Registers KEP metadata and the `MemoryHighEviction` feature gate milestone/stage.
keps/sig-node/5986-memory-high-eviction/README.md	Provides the complete enhancement proposal and PRR questionnaire responses.
keps/prod-readiness/sig-node/5986.yaml	Adds the PRR tracking YAML entry for the KEP.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sohankunkerkar · 2026-06-02T18:14:07Z

cc @haircommander @QiWang19

kannon92 · 2026-06-04T16:10:53Z

Disclaimer: This is an AI-assisted review of the PRR (Production Readiness Review) for alpha requirements. It should not be treated as a substitute for a human PRR approver's review.

PRR Alpha Review Summary — KEP-5986: Per-container memory pressure eviction

Feature Gate: MemoryHighEviction (kubelet, default OFF)
Stage: Alpha, targeting v1.37

Alpha Requirements: All Pass

Area	Status
`kep.yaml` fields (`feature-gates`, `components`, `disable-supported`)	✅
Feature Enablement and Rollback (all 5 questions answered)	✅
Graduation Criteria (alpha, beta, GA defined)	✅
Test Plan (unit + e2e_node)	✅
Scalability (encouraged at alpha, all questions answered)	✅

Strengths

Double-gating design: Even with the feature gate ON, all thresholds default to 0 (no-op). Operators must explicitly configure thresholds to activate detection/eviction. This is an excellent safety pattern.
Stage 1 remediation is non-destructive: Relaxing memory.high to memory.max returns to pre-MemoryQoS behavior. The hard limit is unchanged. This avoids unnecessary evictions.
MemoryQoS dependency is well-documented: memory.events signals require MemoryQoS; PSI works independently. A validation warning is emitted when thresholds are set without MemoryQoS enabled.
Beta/GA sections answered early: Rollout/Rollback, Monitoring, Dependencies, and Troubleshooting sections are filled in with substantive answers even though they are only required at beta.
Comprehensive feature interaction table: Covers MemoryQoS, InPlacePodVerticalScaling, VPA, QoS classes, pod priority, sidecars, container restart, EvictionRequest API, PDB, DRA, swap, and more.

Minor Items to Address

Release Signoff Checklist: The "KEP approvers have approved the KEP status as implementable" checkbox is unchecked (README.md line 80). Should be checked once approvers confirm.
PRR checkboxes: "Production readiness review completed" and "Production readiness review approved" are unchecked (README.md lines 88-89). Should be checked once PRR is approved.
metrics field in kep.yaml: Not required until beta, but the KEP defines two metrics (kubelet_evictions{eviction_signal="memory.high.pressure"} and kubelet_memory_high_relaxed_total) in the README. Consider adding them to kep.yaml proactively.

Known Limitations (Acceptable for Alpha)

Eviction-recreation loop: Acknowledged with reasonable mitigation (grace period bounds loop frequency, stage 1 reduces eviction rate, VPA integration planned for beta).
No PDB support: Kubelet eviction does not check PDBs. EvictionRequest API integration at beta will address this.
No production-validated default thresholds: All thresholds default to 0 (disabled). Data-driven defaults planned for beta based on alpha feedback.

Verdict

The KEP meets all alpha PRR requirements. The design is well-reasoned with strong safety guarantees for rollback and incremental adoption.

kannon92 · 2026-06-04T16:14:05Z

Really nice PRR work for alpha.

#6141 (comment)

Very minor nit.

/approve

k8s-ci-robot · 2026-06-04T16:14:20Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92, sohankunkerkar

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [kannon92]
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haircommander · 2026-06-04T19:08:33Z

+- Stage 1 (relaxing `memory.high`) modifies container cgroup state, unlike
+  existing eviction which is read-only until the kill.
+
+## Alternatives


an alternative @kannon92 hinted at offline could be an external eviction controller that triggers an api level eviction. that may be too slow, so another alternative could be an eviction endpoint in the kubelet that the controller would hook into. I think structurally eviction code is a bit brittle and I do think it's worth considering whether want to continue to extend in-tree. Can you put an agenda item in sig node next week to discuss @sohankunkerkar ?

Agree, worth discussing at sig-node. No kubelet eviction endpoint exists today as building one needs its own design. An external controller still needs a node-local actor for stage 1 CRI writes, so it's a multi-KEP architecture split. This KEP follows the existing localStorageEviction parallel check pattern. It can proceed at alpha while the broader architecture discussion happens, but the two are orthogonal.

If we draw parallels to localStorageEviction, should there be a pod limit (like ephemeral-storage)?

Actually, memory.high already is the per-container limit. MemoryQoS computes it from the pod spec's memory request and limit and sets it on the cgroup. localStorageEviction needs ephemeral-storage limits because there's no kernel enforcement. Here the kernel enforcement (memory.high throttling) already exists. This KEP adds detection and remediation for when that throttling becomes sustained.

BTW an advantage to an external controller is better handling of https://github.com/kubernetes/enhancements/pull/6141/changes#r3358885558. In fact, something like KEDA could be close to being able to do this today. it may not have the knobs to trigger an api eviction, but that may be an easier extension in the grand scheme of changes, and it could better handle scaling up important pods, vs triggering an eviciton for less important pods

haircommander · 2026-06-04T19:32:49Z

+Stage 1 needs to call CRI `UpdateContainerResources` to relax
+`memory.high`. The eviction manager does not have direct CRI access. A


I kind of think this should be tiered similar to memory protection. a guaranteed pod or high priority pod should have its memory.high increased, but a best-effort probably shouldn't. further, how do we make a signal to VPA that the memory limit of the pod should increase as well?

https://github.com/kubernetes/enhancements/pull/6141/changes#diff-afe656367ee3f1986c64f7208b9410337adee92ae1cfbedeac7372c046efab1eR459 hm jk

Guaranteed pods are exempt (no memory.high set). QoS-tiered stage 1 policy and VPA signal are not in alpha today, but good candidates for beta refinement.

haircommander · 2026-06-04T19:34:39Z

+- A warning event with reason `Evicted` and a message indicating which
+  container triggered the eviction and which signal was exceeded (e.g.,
+  "Container foo exceeded memory.high throttle threshold")
+- Annotations: `OffendingContainersKey` (throttled container name) and


annotations on what? the pod? that seems odd imo. probably something in status would be better

These are event annotations (AnnotatedEventf), not pod annotations. They're the same schema the node-pressure eviction path uses (OffendingContainersKey, StarvedResourceKey). The primary status signal is the DisruptionTarget pod condition.

haircommander · 2026-06-04T19:36:38Z

+evictions. Coupling two alpha features would create fragility since
+EvictionRequest's API may change before beta.
+
+At beta, the stage 2 eviction step routes through EvictionRequest, giving


who routes? the kubelet? seems at risk of timing issues where the kernel may act too fast

also this is another reason to me to have this managed by an out of tree controller. tbh

At alpha, no EvictionRequest routing. Stage 2 uses the same direct kubelet eviction path as all existing evictions. On timing, memory.high throttles but doesn't kill. The kernel only kills at memory.max/OOM, which exists with or without this feature.

also this is another reason to me to have this managed by an out of tree controller. tbh

Even with external detection, stage 1 still needs a privileged node-local actor with CRI access for UpdateContainerResources. Stage 2 via API adds a control-plane hop. So the controller becomes another privileged node-local resource manager with its own stats pipeline and cgroup writes.

haircommander · 2026-06-04T19:38:38Z

+
+#### Stage 1 Dependencies
+
+Stage 1 needs to call CRI `UpdateContainerResources` to relax


btw a similar thing could be achieved by just increasing memory limit. then the memory.high would be increased (right?) and the overall goal of making the container less close to OOM would be acheived

I think stage 1 only removes memory.high, which is a throttle boundary the kubelet set via MemoryQoS. memory.max stays unchanged, pod spec stays unchanged. Increasing the memory limit is an API-level operation that changes memory.max, scheduling, and quota. The eviction manager has no API client to do that, and shouldn't. Stage 1 is scoped to undo what the kubelet did. If the workload stabilizes below memory.max, no eviction needed.

HirazawaUi · 2026-06-07T15:08:10Z

+- Annotations: `OffendingContainersKey` (throttled container name) and
+  `StarvedResourceKey` ("memory")
+
+Stage 1 emits a warning event indicating that `memory.high` relaxation was


Should we consider adding a condition or a status field to indicate that a Pod is under high memory pressure after the first stage is triggered but before eviction? This could allow automated operators to trigger a Pod resize, especially since events are not always reliable.

Good point! I added a ContainerMemoryPressure pod condition at stage 1. It's queryable and watchable, unlike events which can be missed.

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>

haircommander · 2026-06-08T19:55:52Z

Some pieces of this feel a bit funky to me (changing eviction mgr semantics) but an overarching worry I have is this is actually possible with an external controller after we get eviction request API (or even today with triggering a graceful termination of a pod). An external controller could use the existing metrics exposed (like KEDA does today) and:

for high priority/guaranteed pods, use IPPR to increase the memory limit, thus increasing the memory.high
for low priority/burstable/best effort pods, either evict or terminate the pod if the containers are continually disobeying their memory.high.

I'm not currently seeing a reason why kubelet needs to be in charge of this. the best answer I have is it'd be faster, and thus potentially lower the likelihood the kernel steps in, but I'd want to see that in practice before feeling certain that's enough of a problem to warrant changing the existing eviction semantics to keep state like this is proposing

we have an item on sig node agenda tomorrow, i'll voice my concerns there as well. Sorry for not catching this earlier, I initially +1'd you opening it because I didn't think thoroughly enough about it

sohankunkerkar · 2026-06-09T04:33:36Z

Some pieces of this feel a bit funky to me (changing eviction mgr semantics) but an overarching worry I have is this is actually possible with an external controller after we get eviction request API (or even today with triggering a graceful termination of a pod). An external controller could use the existing metrics exposed (like KEDA does today) and:

for high priority/guaranteed pods, use IPPR to increase the memory limit, thus increasing the memory.high

for low priority/burstable/best effort pods, either evict or terminate the pod if the containers are continually disobeying their memory.high.

I'm not currently seeing a reason why kubelet needs to be in charge of this. the best answer I have is it'd be faster, and thus potentially lower the likelihood the kernel steps in, but I'd want to see that in practice before feeling certain that's enough of a problem to warrant changing the existing eviction semantics to keep state like this is proposing

we have an item on sig node agenda tomorrow, i'll voice my concerns there as well. Sorry for not catching this earlier, I initially +1'd you opening it because I didn't think thoroughly enough about it

Thanks for the thoughtful feedback Peter, and no worries about the timing. This is exactly the kind of discussion that helps make KEPs better. I have a few thoughts I wanted to share below:

On the eviction manager semantics concern: I agree this is different from the current signal-based eviction flow because it introduces stateful tracking across sync cycles and grace periods. I’ve been prototyping an evictionCheck interface to make these checks a first-class extension point in synchronize(), similar to how localStorageEviction already works today. The idea is to keep this logic cleanly separated instead of mixing it into the existing signal-based path.

On the external controller idea: I agree stage 2 (eviction/termination) could already be implemented externally using graceful termination. For the detection + stage 1 remediation side though, I could not find an existing controller that covers this. KEDA is horizontal only, Kedify PRA supports in-place resize but based on utilization thresholds rather than memory.events + PSI throttle rates, and VPA does not react to real-time throttling. So the controller would still need to be designed and built for this use case. The kubelet already has the stats pipeline, CRI access, and eviction path, so for alpha it seemed like the shorter path to validating the behavior.

On IPPR vs stage 1: IPPR being GA definitely helps, but it is solving a different problem. IPPR changes pod resources through the API, goes through admission/quota flows, and persists as part of pod state. Stage 1 is intentionally temporary and local to the node. It only writes memory.high=max through CRI and reverts on restart without changing pod spec or quotas. Also, for Guaranteed containers, kubelet skips memory.high entirely (request == limit), so IPPR wouldn't change memory.high for those.

On latency: I agree I do not yet have production data showing the latency difference is important. That is a fair point. My thinking is that alpha is the right place to validate this. We can measure how quickly the in-tree path reacts to throttling compared to what an external scrape/reconcile loop could realistically achieve.

On graduation: the current KEP already plans to integrate EvictionRequest at beta for PDB-aware eviction, and potentially use InPlacePodResize at GA as a “resize before evict” path. So the longer-term direction is compatible with the controller-based ideas you mentioned as well.

I think the main question is whether we want to ship the in-tree detection + remediation path behind a feature gate for alpha now, or wait for a separate external controller design and implementation first.

haircommander · 2026-06-09T18:55:26Z

So the controller would still need to be designed and built for this use case. The kubelet already has the stats pipeline, CRI access, and eviction path, so for alpha it seemed like the shorter path to validating the behavior.

that's fine, it's not terribly hard to write a controller. doing so externally speeds up development time and reduces the restriction to tie with k8s releases.

It only writes memory.high=max through CRI and reverts on restart without changing pod spec or quotas. Also, for Guaranteed containers, kubelet skips memory.high entirely (request == limit), so IPPR wouldn't change memory.high for those.

good point, but the point still stands that burstable pods could be resized and get a different memory.high value. I think it's a bit weird to have some pods within the same qos class have memory.high and some not, especially because a restart will cause it to flap. better to persist in the api so a restart brings back to the old allocation. This is extra reason to me to go through api rather than sneak behind it

ajaysundark · 2026-06-13T16:22:16Z

/cc

sohankunkerkar · 2026-06-16T17:16:53Z

After discussing with @haircommander , we're pursuing an external operator for detection and remediation. The operator uses cadvisor pressure metrics (memory.events:high, PSI) for detection and InPlacePodResize for the first remediation step (raise the memory limit so kubelet recomputes memory.high). When IPPR doesn't resolve pressure (max attempts reached, admission rejection, or node capacity), the operator will escalate via the EvictionRequest API (KEP-4563) once it graduates, providing PDB-aware graceful eviction.

This covers the core detection + remediation flow but outside the kubelet. The stats API extension and ContainerMemoryPressure pod condition from this KEP can be revisited as separate proposals if needed. We'll evaluate the latency tradeoff with real numbers from the prototype and share results. The KEP stays open for now. If the external approach proves insufficient, we'll revisit with evidence.

Copilot AI review requested due to automatic review settings June 2, 2026 17:56

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 2, 2026

k8s-ci-robot requested review from dchen1107 and palnabarun June 2, 2026 17:56

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 2, 2026

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread keps/prod-readiness/sig-node/5986.yaml Outdated

Comment thread keps/sig-node/5986-memory-high-eviction/README.md

Comment thread keps/sig-node/5986-memory-high-eviction/README.md Outdated

sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from d7f6c07 to 5078934 Compare June 2, 2026 18:11

sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from 5078934 to af72ccf Compare June 2, 2026 18:40

This was referenced Jun 2, 2026

KEP-4563: update EvictionRequest API #6074

Merged

Per-container memory pressure eviction #5986

Open

kannon92 reviewed Jun 4, 2026

View reviewed changes

Comment thread keps/prod-readiness/sig-node/5986.yaml Outdated

sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from af72ccf to 40c8893 Compare June 4, 2026 03:21

kannon92 reviewed Jun 4, 2026

View reviewed changes

Comment thread keps/sig-node/5986-memory-high-eviction/README.md Outdated

sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from 40c8893 to a41eb79 Compare June 4, 2026 15:46

kannon92 reviewed Jun 4, 2026

View reviewed changes

Comment thread keps/sig-node/5986-memory-high-eviction/kep.yaml

haircommander reviewed Jun 4, 2026

View reviewed changes

Comment thread keps/sig-node/5986-memory-high-eviction/README.md Outdated

haircommander reviewed Jun 4, 2026

View reviewed changes

sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from a41eb79 to 26ef134 Compare June 5, 2026 03:43

HirazawaUi reviewed Jun 7, 2026

View reviewed changes

sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from 26ef134 to f5caccd Compare June 8, 2026 04:01

KEP-5986: Per-container memory pressure eviction

586edfd

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>

sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from f5caccd to 586edfd Compare June 8, 2026 04:16

k8s-ci-robot requested a review from ajaysundark June 13, 2026 16:22

		Stage 1 needs to call CRI `UpdateContainerResources` to relax
		`memory.high`. The eviction manager does not have direct CRI access. A


		#### Stage 1 Dependencies

		Stage 1 needs to call CRI `UpdateContainerResources` to relax

Conversation

sohankunkerkar commented Jun 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sohankunkerkar commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kannon92 commented Jun 4, 2026

PRR Alpha Review Summary — KEP-5986: Per-container memory pressure eviction

Alpha Requirements: All Pass

Strengths

Minor Items to Address

Known Limitations (Acceptable for Alpha)

Verdict

Uh oh!

kannon92 commented Jun 4, 2026

Uh oh!

k8s-ci-robot commented Jun 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sohankunkerkar Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

haircommander commented Jun 8, 2026

Uh oh!

sohankunkerkar commented Jun 9, 2026

Uh oh!

haircommander commented Jun 9, 2026

Uh oh!

ajaysundark commented Jun 13, 2026

Uh oh!

sohankunkerkar commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

sohankunkerkar Jun 8, 2026 •

edited

Loading