OOM killer behaviour #13384

WinterNis · 2026-05-19T08:13:40Z

WinterNis
May 19, 2026

Hi,

Yesterday we noticed some "weird" behaviour with talos OOM handler and were wondering if it was expected behaviour or not.

Context:

We had a node that was using 85-90% of its ram and then on top of that some pods started to spike, triggering the OOM handler.
Not here to discuss the actual usage of the node (which was bad due to bad sizing on our part).

Nodes were running talos 1.12 with default OOM settings. No specific configuration on this part.

Observations:

Here are the (filtered) logs of the OOM handler:

1779096425927	2026-05-18T09:27:05.927Z	user: warning: [2026-05-18T09:27:21.131910481Z]: [talos] Sending SIGKILL to cgroup
1779096425927	2026-05-18T09:27:05.927Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096424927	2026-05-18T09:27:04.927Z	user: warning: [2026-05-18T09:27:20.132378481Z]: [talos] Sending SIGKILL to cgroup
1779096424927	2026-05-18T09:27:04.927Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096424431	2026-05-18T09:27:04.431Z	user: warning: [2026-05-18T09:27:19.637696481Z]: [talos] Sending SIGKILL to cgroup
1779096424430	2026-05-18T09:27:04.430Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096423930	2026-05-18T09:27:03.930Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096423930	2026-05-18T09:27:03.930Z	user: warning: [2026-05-18T09:27:19.132937481Z]: [talos] Sending SIGKILL to cgroup
1779096415439	2026-05-18T09:26:55.439Z	user: warning: [2026-05-18T09:27:10.643467481Z]: [talos] Sending SIGKILL to cgroup
1779096415439	2026-05-18T09:26:55.439Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096414927	2026-05-18T09:26:54.927Z	user: warning: [2026-05-18T09:27:10.131902481Z]: [talos] Sending SIGKILL to cgroup
1779096414927	2026-05-18T09:26:54.927Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096414337	2026-05-18T09:26:54.337Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096414337	2026-05-18T09:26:54.337Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096414337	2026-05-18T09:26:54.337Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096414337	2026-05-18T09:26:54.337Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096414337	2026-05-18T09:26:54.337Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096414337	2026-05-18T09:26:54.337Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096414337	2026-05-18T09:26:54.337Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779096414337	2026-05-18T09:26:54.337Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779095397927	2026-05-18T09:09:57.927Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779095397927	2026-05-18T09:09:57.927Z	user: warning: [2026-05-18T09:10:13.124574481Z]: [talos] Sending SIGKILL to cgroup
1779095321169	2026-05-18T09:08:41.169Z	user: warning: [2026-05-18T09:08:44.733752481Z]: [talos] Sending SIGKILL to cgroup
1779095313854	2026-05-18T09:08:33.854Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779095295927	2026-05-18T09:08:15.927Z	user: warning: [2026-05-18T09:08:31.125256481Z]: [talos] Sending SIGKILL to cgroup
1779095295926	2026-05-18T09:08:15.926Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779095295427	2026-05-18T09:08:15.427Z	user: warning: [2026-05-18T09:08:30.624384481Z]: [talos] Sending SIGKILL to cgroup
1779095295427	2026-05-18T09:08:15.427Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779095294927	2026-05-18T09:08:14.927Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779095294927	2026-05-18T09:08:14.927Z	user: warning: [2026-05-18T09:08:30.124879481Z]: [talos] Sending SIGKILL to cgroup
1779095294426	2026-05-18T09:08:14.426Z	user: warning: [2026-05-18T09:08:29.624888481Z]: [talos] Sending SIGKILL to cgroup
1779095294426	2026-05-18T09:08:14.426Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779095292934	2026-05-18T09:08:12.934Z	user: warning: [2026-05-18T09:08:28.133953481Z]: [talos] Sending SIGKILL to cgroup
1779095292934	2026-05-18T09:08:12.934Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}
1779095207427	2026-05-18T09:06:47.427Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/burstable/pod2954adc2-d5cd-4919-a609-bd888dd24fc6"}
1779095207427	2026-05-18T09:06:47.427Z	user: warning: [2026-05-18T09:07:02.625477481Z]: [talos] Sending SIGKILL to cgroup
1779095206927	2026-05-18T09:06:46.927Z	user: warning: [2026-05-18T09:07:02.124226481Z]: [talos] Sending SIGKILL to cgroup
1779095206927	2026-05-18T09:06:46.927Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/burstable/podf534cd7b-afc0-40ca-baf3-7a952d1da2e0"}
1779095206831	2026-05-18T09:06:46.831Z	user: warning: [2026-05-18T09:07:02.027771481Z]: [talos] Sending SIGKILL to cgroup
1779095206830	2026-05-18T09:06:46.830Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/burstable/podc6eb0aa5-d30e-4daf-8ea8-16c3c2c76f36"}
1779095178931	2026-05-18T09:06:18.931Z	user: warning: [2026-05-18T09:06:34.125662481Z]: [talos] Sending SIGKILL to cgroup
1779095178931	2026-05-18T09:06:18.931Z	[talos] Sending SIGKILL to cgroup {"component": "controller-runtime", "controller": "runtime.OOMController", "cgroup": "/sys/fs/cgroup/kubepods/besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6"}

We can see two things:

The handler is trying to kill burstable/podc6eb0aa5-d30e-4daf-8ea8-16c3c2c76f36 three times.
The handler is trying to kill besteffort/pod1988631d-7143-4b06-888e-54d5dc6ccca6 repeatedly

The first pod (burstable) was actually ... cilium.
So killing the cni pod actually made things way worse on the node as some other pods started losing network connection, and started pilling even more memory as they were unable to send the data they were holding. And at the time there were a lot of other (more relevant in term of memory usage) burstable pods to be killed instead.

The second pod (that gets killed repeatedly) was actually the only besteffort pod on the node.
So as per default configuration, it gets the highest "priority" at getting killed by the handler. (btw it was a daemonset pod)
The usage of the pod was quite low and killing it again and again did not really help.

In the end, the kernel OOM killer did its job and the node came back on its own.

I am wondering though, is talos OOM handler supposed to do this ? 

I would have expected it to have some kind of protection for important pods such as cilium (but I am not sure how the handler can actually know about this)
Also is it expected that it tries to kill the same pod again and again, even considering priority ?

We feel like we have not properly understood the intent of this OOM handler. Or we have not configured it properly ? TBH I am not sure how to tell the OOM handler to not even try to kill cilium for example.

While we see the value it brings, we are also considering the "chaos" that go with it, and are wondering if we should not disable the handler in the end, as other have already done.

Thank you for your guidance 🙏

smira · 2026-05-19T09:07:07Z

smira
May 19, 2026
Maintainer

Thanks for the report, we are always looking to improve the expression.

The OOM handler had a fix for the expression in one of the first 1.12.x releases, so I hope you're not running an outdated version.

The OOM handler doesn't see if the pod is Cilium or not, it relies on scheduling classes, so if the only group (in the default setup) which doesn't get killed is Guaranteed. So the way to protect the pod is to add proper resource specification to it.

You can see the default expressions for triggering OOM handler and ranking here: https://docs.siderolabs.com/talos/v1.13/reference/configuration/runtime/oomconfig

Every "kill" is recorded into OOMAction resource, so you can see why something was killed at that moment.

0 replies

WinterNis · 2026-05-19T14:14:18Z

WinterNis
May 19, 2026
Author

Thanks for the insight, it’s really helpful, as always.

The OOM handler had a fix for the expression in one of the first 1.12.x releases, so I hope you're not running an outdated version.

I suppose you are talking about 7ddb37b1f, which was shipped in 1.12.2. The nodes from our example were running 1.12.3 so we actually had this fix.
But we did not have a particular issue with the trigger expression. The part bothering us is actually the cgroup ranking mecanism, which did not change recently AFAIK.

The OOM handler doesn't see if the pod is Cilium or not, it relies on scheduling classes, so if the only group (in the default setup) which doesn't get killed is Guaranteed. So the way to protect the pod is to add proper resource specification to it.

AFAIK, while it’s really important to closely monitor its resource usage, it’s not really recommended to have hard memory limit on core pods like cilium, and even less recommended to have cpu limit (throttling issues). So having it with QOS class guaranteed is not really an option ?
Btw, I though that having a score of 0.0 did not prevent the OOM handler to attempt to kill a pod. It just made it less a priority to be killed when compared with other pods with score > 0.0.
Did I get that wrong ?

Every "kill" is recorded into OOMAction resource, so you can see why something was killed at that moment.

Thanks, I did not know about that. It’s a bit hard to correlate these with actual OOMs though,  since the OOMAction does not contains the cgroup it did attempt to kill, nor the QOS class (edited: actually it does include the process name, my bad), nor details about the memory_current for example.
I do not know if that’s actually possible/scalable but having a dump of all pods + current score would help debugging such issue (maybe not in the OOMAction but via a talos command for ex)

EDIT: my bad, we can actually see processes that gets killed on OOMAction, that’s a misread on my part.

0 replies

smira · 2026-06-03T16:58:02Z

smira
Jun 3, 2026
Maintainer

See also #13330 - this change should help.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM killer behaviour #13384

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

OOM killer behaviour #13384

Uh oh!

WinterNis May 19, 2026

Context:

Observations:

Replies: 3 comments

Uh oh!

smira May 19, 2026 Maintainer

Uh oh!

Uh oh!

WinterNis May 19, 2026 Author

Uh oh!

smira Jun 3, 2026 Maintainer

WinterNis
May 19, 2026

smira
May 19, 2026
Maintainer

WinterNis
May 19, 2026
Author

smira
Jun 3, 2026
Maintainer