OOM Detection

## Problem/Opportunity Statement

we will eventually enable memory limits for CI jobs. There is no current way to detect this in k8s/prometheus in our environment.

For example, I set `KUBERNETES_MEMORY_LIMIT=1500M` for [this job](https://gitlab.spack.io/spack/spack/-/jobs/12730664), which was killed shortly after starting. There is no error reason in the log or in the exit code. See [this](https://search-spack-skhqjs43rbvm5fif4pwflh64kq.us-east-1.es.amazonaws.com/_dashboards/app/discover#/doc/59594930-ae4a-11ed-a0b4-4f356475697b/kube-events-2024-10-01?id=zJ9KSZIBIDOFV7D_lqhL) opensearch query.

The `kube_pod_container_status_last_terminated_exitcode` metric is supposed to indicate an OOM kill for a job, but this isn't working.

relevant issues:
- https://github.com/google/cadvisor/issues/3015
- https://github.com/kubernetes/node-problem-detector/issues/766

I came across a [blog post](https://engineering.outschool.com/posts/gitlab-runner-on-kubernetes/#out-of-memory-detection) that describes the same issue and I've been corresponding with the author (@jimmy-outschool)

According to his info, k8s is looking for the primary process to exit due to OOM instead of the non-pid 1 process that is launched by the gitlab runner.

## What would success / a fix look like?

His [solution](https://gitlab.com/outschool-eng/gitlab-runner/-/commit/65d5c4d468ffdbde0ceeafd9168d1326bae8e708) involves a small patch to gitlab runner, which looks for OOM events in the kernel message buffer and outputs the correct exit code to the log. He has attempted to upstream this to no avail.

While we may face headwinds when pushing to deploy a custom version of gitlab runners, the alternative solutions are not great:

1. using memory usage, we could see if the last reported number is within 90% of the limit to determine if it was killed. However, the spikes are so large that I've seen figures as low as 70% of limit before OOM killed.
2. recent kernel versions have support for cgroups v2 which [detects if any non-main processes were OOM killed](https://itnext.io/kubernetes-silent-pod-killer-104e7c8054d9) and reports those statuses. However, many of our [runner containers](https://github.com/spack/gitlab-runners/tree/main/Dockerfiles) are using OS versions outside of the [support matrix](https://kubernetes.io/docs/concepts/architecture/cgroups/#requirements) for this feature.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM Detection #117

Problem/Opportunity Statement

What would success / a fix look like?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OOM Detection #117

Description

Problem/Opportunity Statement

What would success / a fix look like?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions