Skip to content

KEP-5986: Per-container memory pressure eviction#6141

Open
sohankunkerkar wants to merge 1 commit into
kubernetes:masterfrom
sohankunkerkar:kep-5986-memory-high-eviction
Open

KEP-5986: Per-container memory pressure eviction#6141
sohankunkerkar wants to merge 1 commit into
kubernetes:masterfrom
sohankunkerkar:kep-5986-memory-high-eviction

Conversation

@sohankunkerkar

Copy link
Copy Markdown
Member
  • One-line PR description: Add KEP for per-container memory pressure detection using memory.events counters and PSI with two-stage remediation (relax memory.high, then evict)
  • Other comments: This KEP depends on KEP-2570 (MemoryQoS) which sets memory.high on container cgroups

Copilot AI review requested due to automatic review settings June 2, 2026 17:56
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 2, 2026
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 2, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds KEP-5986 documentation for a new kubelet feature that detects per-container memory pressure (via memory.events + PSI) and performs a two-stage remediation (relax memory.high, then evict) behind the MemoryHighEviction feature gate.

Changes:

  • Introduces KEP metadata (kep.yaml) for KEP-5986 under SIG Node.
  • Adds the full KEP design doc (README.md) detailing detection signals, remediation flow, config, metrics, and PRR answers.
  • Adds the production readiness stub (keps/prod-readiness/sig-node/5986.yaml) for PRR tracking.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
keps/sig-node/5986-memory-high-eviction/kep.yaml Registers KEP metadata and the MemoryHighEviction feature gate milestone/stage.
keps/sig-node/5986-memory-high-eviction/README.md Provides the complete enhancement proposal and PRR questionnaire responses.
keps/prod-readiness/sig-node/5986.yaml Adds the PRR tracking YAML entry for the KEP.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread keps/prod-readiness/sig-node/5986.yaml Outdated
Comment thread keps/sig-node/5986-memory-high-eviction/README.md
Comment thread keps/sig-node/5986-memory-high-eviction/README.md Outdated
@sohankunkerkar sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from d7f6c07 to 5078934 Compare June 2, 2026 18:11
@sohankunkerkar

Copy link
Copy Markdown
Member Author

cc @haircommander @QiWang19

Comment thread keps/prod-readiness/sig-node/5986.yaml Outdated
@sohankunkerkar sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from af72ccf to 40c8893 Compare June 4, 2026 03:21
Comment thread keps/sig-node/5986-memory-high-eviction/README.md Outdated
@sohankunkerkar sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from 40c8893 to a41eb79 Compare June 4, 2026 15:46
Comment thread keps/sig-node/5986-memory-high-eviction/kep.yaml
@kannon92

kannon92 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Disclaimer: This is an AI-assisted review of the PRR (Production Readiness Review) for alpha requirements. It should not be treated as a substitute for a human PRR approver's review.

PRR Alpha Review Summary — KEP-5986: Per-container memory pressure eviction

Feature Gate: MemoryHighEviction (kubelet, default OFF)
Stage: Alpha, targeting v1.37

Alpha Requirements: All Pass

Area Status
kep.yaml fields (feature-gates, components, disable-supported)
Feature Enablement and Rollback (all 5 questions answered)
Graduation Criteria (alpha, beta, GA defined)
Test Plan (unit + e2e_node)
Scalability (encouraged at alpha, all questions answered)

Strengths

  • Double-gating design: Even with the feature gate ON, all thresholds default to 0 (no-op). Operators must explicitly configure thresholds to activate detection/eviction. This is an excellent safety pattern.
  • Stage 1 remediation is non-destructive: Relaxing memory.high to memory.max returns to pre-MemoryQoS behavior. The hard limit is unchanged. This avoids unnecessary evictions.
  • MemoryQoS dependency is well-documented: memory.events signals require MemoryQoS; PSI works independently. A validation warning is emitted when thresholds are set without MemoryQoS enabled.
  • Beta/GA sections answered early: Rollout/Rollback, Monitoring, Dependencies, and Troubleshooting sections are filled in with substantive answers even though they are only required at beta.
  • Comprehensive feature interaction table: Covers MemoryQoS, InPlacePodVerticalScaling, VPA, QoS classes, pod priority, sidecars, container restart, EvictionRequest API, PDB, DRA, swap, and more.

Minor Items to Address

  1. Release Signoff Checklist: The "KEP approvers have approved the KEP status as implementable" checkbox is unchecked (README.md line 80). Should be checked once approvers confirm.
  2. PRR checkboxes: "Production readiness review completed" and "Production readiness review approved" are unchecked (README.md lines 88-89). Should be checked once PRR is approved.
  3. metrics field in kep.yaml: Not required until beta, but the KEP defines two metrics (kubelet_evictions{eviction_signal="memory.high.pressure"} and kubelet_memory_high_relaxed_total) in the README. Consider adding them to kep.yaml proactively.

Known Limitations (Acceptable for Alpha)

  • Eviction-recreation loop: Acknowledged with reasonable mitigation (grace period bounds loop frequency, stage 1 reduces eviction rate, VPA integration planned for beta).
  • No PDB support: Kubelet eviction does not check PDBs. EvictionRequest API integration at beta will address this.
  • No production-validated default thresholds: All thresholds default to 0 (disabled). Data-driven defaults planned for beta based on alpha feedback.

Verdict

The KEP meets all alpha PRR requirements. The design is well-reasoned with strong safety guarantees for rollback and incremental adoption.

@kannon92

kannon92 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Really nice PRR work for alpha.

#6141 (comment)

Very minor nit.

/approve

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92, sohankunkerkar

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- Stage 1 (relaxing `memory.high`) modifies container cgroup state, unlike
existing eviction which is read-only until the kill.

## Alternatives

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative @kannon92 hinted at offline could be an external eviction controller that triggers an api level eviction. that may be too slow, so another alternative could be an eviction endpoint in the kubelet that the controller would hook into. I think structurally eviction code is a bit brittle and I do think it's worth considering whether want to continue to extend in-tree. Can you put an agenda item in sig node next week to discuss @sohankunkerkar ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, worth discussing at sig-node. No kubelet eviction endpoint exists today as building one needs its own design. An external controller still needs a node-local actor for stage 1 CRI writes, so it's a multi-KEP architecture split. This KEP follows the existing localStorageEviction parallel check pattern. It can proceed at alpha while the broader architecture discussion happens, but the two are orthogonal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we draw parallels to localStorageEviction, should there be a pod limit (like ephemeral-storage)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, memory.high already is the per-container limit. MemoryQoS computes it from the pod spec's memory request and limit and sets it on the cgroup. localStorageEviction needs ephemeral-storage limits because there's no kernel enforcement. Here the kernel enforcement (memory.high throttling) already exists. This KEP adds detection and remediation for when that throttling becomes sustained.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW an advantage to an external controller is better handling of https://github.com/kubernetes/enhancements/pull/6141/changes#r3358885558. In fact, something like KEDA could be close to being able to do this today. it may not have the knobs to trigger an api eviction, but that may be an easier extension in the grand scheme of changes, and it could better handle scaling up important pods, vs triggering an eviciton for less important pods

Comment on lines +432 to +433
Stage 1 needs to call CRI `UpdateContainerResources` to relax
`memory.high`. The eviction manager does not have direct CRI access. A

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of think this should be tiered similar to memory protection. a guaranteed pod or high priority pod should have its memory.high increased, but a best-effort probably shouldn't. further, how do we make a signal to VPA that the memory limit of the pod should increase as well?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guaranteed pods are exempt (no memory.high set). QoS-tiered stage 1 policy and VPA signal are not in alpha today, but good candidates for beta refinement.

Comment thread keps/sig-node/5986-memory-high-eviction/README.md Outdated
- A warning event with reason `Evicted` and a message indicating which
container triggered the eviction and which signal was exceeded (e.g.,
"Container foo exceeded memory.high throttle threshold")
- Annotations: `OffendingContainersKey` (throttled container name) and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotations on what? the pod? that seems odd imo. probably something in status would be better

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are event annotations (AnnotatedEventf), not pod annotations. They're the same schema the node-pressure eviction path uses (OffendingContainersKey, StarvedResourceKey). The primary status signal is the DisruptionTarget pod condition.

evictions. Coupling two alpha features would create fragility since
EvictionRequest's API may change before beta.

At beta, the stage 2 eviction step routes through EvictionRequest, giving

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who routes? the kubelet? seems at risk of timing issues where the kernel may act too fast

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this is another reason to me to have this managed by an out of tree controller. tbh

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At alpha, no EvictionRequest routing. Stage 2 uses the same direct kubelet eviction path as all existing evictions. On timing, memory.high throttles but doesn't kill. The kernel only kills at memory.max/OOM, which exists with or without this feature.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this is another reason to me to have this managed by an out of tree controller. tbh

Even with external detection, stage 1 still needs a privileged node-local actor with CRI access for UpdateContainerResources. Stage 2 via API adds a control-plane hop. So the controller becomes another privileged node-local resource manager with its own stats pipeline and cgroup writes.


#### Stage 1 Dependencies

Stage 1 needs to call CRI `UpdateContainerResources` to relax

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw a similar thing could be achieved by just increasing memory limit. then the memory.high would be increased (right?) and the overall goal of making the container less close to OOM would be acheived

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think stage 1 only removes memory.high, which is a throttle boundary the kubelet set via MemoryQoS. memory.max stays unchanged, pod spec stays unchanged. Increasing the memory limit is an API-level operation that changes memory.max, scheduling, and quota. The eviction manager has no API client to do that, and shouldn't. Stage 1 is scoped to undo what the kubelet did. If the workload stabilizes below memory.max, no eviction needed.

@sohankunkerkar sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from a41eb79 to 26ef134 Compare June 5, 2026 03:43
- Annotations: `OffendingContainersKey` (throttled container name) and
`StarvedResourceKey` ("memory")

Stage 1 emits a warning event indicating that `memory.high` relaxation was

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider adding a condition or a status field to indicate that a Pod is under high memory pressure after the first stage is triggered but before eviction? This could allow automated operators to trigger a Pod resize, especially since events are not always reliable.

@sohankunkerkar sohankunkerkar Jun 8, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I added a ContainerMemoryPressure pod condition at stage 1. It's queryable and watchable, unlike events which can be missed.

Comment thread keps/sig-node/5986-memory-high-eviction/README.md Outdated
@sohankunkerkar sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from 26ef134 to f5caccd Compare June 8, 2026 04:01
Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
@sohankunkerkar sohankunkerkar force-pushed the kep-5986-memory-high-eviction branch from f5caccd to 586edfd Compare June 8, 2026 04:16
@haircommander

Copy link
Copy Markdown
Contributor

Some pieces of this feel a bit funky to me (changing eviction mgr semantics) but an overarching worry I have is this is actually possible with an external controller after we get eviction request API (or even today with triggering a graceful termination of a pod). An external controller could use the existing metrics exposed (like KEDA does today) and:

  • for high priority/guaranteed pods, use IPPR to increase the memory limit, thus increasing the memory.high
  • for low priority/burstable/best effort pods, either evict or terminate the pod if the containers are continually disobeying their memory.high.

I'm not currently seeing a reason why kubelet needs to be in charge of this. the best answer I have is it'd be faster, and thus potentially lower the likelihood the kernel steps in, but I'd want to see that in practice before feeling certain that's enough of a problem to warrant changing the existing eviction semantics to keep state like this is proposing

we have an item on sig node agenda tomorrow, i'll voice my concerns there as well. Sorry for not catching this earlier, I initially +1'd you opening it because I didn't think thoroughly enough about it

@sohankunkerkar

Copy link
Copy Markdown
Member Author

Some pieces of this feel a bit funky to me (changing eviction mgr semantics) but an overarching worry I have is this is actually possible with an external controller after we get eviction request API (or even today with triggering a graceful termination of a pod). An external controller could use the existing metrics exposed (like KEDA does today) and:

  • for high priority/guaranteed pods, use IPPR to increase the memory limit, thus increasing the memory.high
  • for low priority/burstable/best effort pods, either evict or terminate the pod if the containers are continually disobeying their memory.high.

I'm not currently seeing a reason why kubelet needs to be in charge of this. the best answer I have is it'd be faster, and thus potentially lower the likelihood the kernel steps in, but I'd want to see that in practice before feeling certain that's enough of a problem to warrant changing the existing eviction semantics to keep state like this is proposing

we have an item on sig node agenda tomorrow, i'll voice my concerns there as well. Sorry for not catching this earlier, I initially +1'd you opening it because I didn't think thoroughly enough about it

Thanks for the thoughtful feedback Peter, and no worries about the timing. This is exactly the kind of discussion that helps make KEPs better. I have a few thoughts I wanted to share below:

On the eviction manager semantics concern: I agree this is different from the current signal-based eviction flow because it introduces stateful tracking across sync cycles and grace periods. I’ve been prototyping an evictionCheck interface to make these checks a first-class extension point in synchronize(), similar to how localStorageEviction already works today. The idea is to keep this logic cleanly separated instead of mixing it into the existing signal-based path.

On the external controller idea: I agree stage 2 (eviction/termination) could already be implemented externally using graceful termination. For the detection + stage 1 remediation side though, I could not find an existing controller that covers this. KEDA is horizontal only, Kedify PRA supports in-place resize but based on utilization thresholds rather than memory.events + PSI throttle rates, and VPA does not react to real-time throttling. So the controller would still need to be designed and built for this use case. The kubelet already has the stats pipeline, CRI access, and eviction path, so for alpha it seemed like the shorter path to validating the behavior.

On IPPR vs stage 1: IPPR being GA definitely helps, but it is solving a different problem. IPPR changes pod resources through the API, goes through admission/quota flows, and persists as part of pod state. Stage 1 is intentionally temporary and local to the node. It only writes memory.high=max through CRI and reverts on restart without changing pod spec or quotas. Also, for Guaranteed containers, kubelet skips memory.high entirely (request == limit), so IPPR wouldn't change memory.high for those.

On latency: I agree I do not yet have production data showing the latency difference is important. That is a fair point. My thinking is that alpha is the right place to validate this. We can measure how quickly the in-tree path reacts to throttling compared to what an external scrape/reconcile loop could realistically achieve.

On graduation: the current KEP already plans to integrate EvictionRequest at beta for PDB-aware eviction, and potentially use InPlacePodResize at GA as a “resize before evict” path. So the longer-term direction is compatible with the controller-based ideas you mentioned as well.

I think the main question is whether we want to ship the in-tree detection + remediation path behind a feature gate for alpha now, or wait for a separate external controller design and implementation first.

@haircommander

Copy link
Copy Markdown
Contributor

So the controller would still need to be designed and built for this use case. The kubelet already has the stats pipeline, CRI access, and eviction path, so for alpha it seemed like the shorter path to validating the behavior.

that's fine, it's not terribly hard to write a controller. doing so externally speeds up development time and reduces the restriction to tie with k8s releases.

It only writes memory.high=max through CRI and reverts on restart without changing pod spec or quotas. Also, for Guaranteed containers, kubelet skips memory.high entirely (request == limit), so IPPR wouldn't change memory.high for those.

good point, but the point still stands that burstable pods could be resized and get a different memory.high value. I think it's a bit weird to have some pods within the same qos class have memory.high and some not, especially because a restart will cause it to flap. better to persist in the api so a restart brings back to the old allocation. This is extra reason to me to go through api rather than sneak behind it

@ajaysundark

Copy link
Copy Markdown

/cc

@k8s-ci-robot k8s-ci-robot requested a review from ajaysundark June 13, 2026 16:22
@sohankunkerkar

Copy link
Copy Markdown
Member Author

After discussing with @haircommander , we're pursuing an external operator for detection and remediation. The operator uses cadvisor pressure metrics (memory.events:high, PSI) for detection and InPlacePodResize for the first remediation step (raise the memory limit so kubelet recomputes memory.high). When IPPR doesn't resolve pressure (max attempts reached, admission rejection, or node capacity), the operator will escalate via the EvictionRequest API (KEP-4563) once it graduates, providing PDB-aware graceful eviction.

This covers the core detection + remediation flow but outside the kubelet. The stats API extension and ContainerMemoryPressure pod condition from this KEP can be revisited as separate proposals if needed. We'll evaluate the latency tradeoff with real numbers from the prototype and share results. The KEP stays open for now. If the external approach proves insufficient, we'll revisit with evidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants