Skip to content

containerOOMEventsDelta not capturing OOMKill on container exit #858

Open
@danielgblanco

Description

Description

We're trying to create dashboards and alerts that capture transient states of Kubernetes container. In particular, we're interested in tracking Error and OOMKilled termination states. AFAICT the New Relic integration is not always able to capture OOMKills correctly when the container restarts (comparing to kube_pod_container_status_last_terminated_reason), because at the moment it scrapes the Kubelet the container has already been restarted and even though at some point in between scrapes the status changed to Terminated and the reason to OOMKilled, as it is not the current state, it never gets reported.

My hope with the new containerOOMEventsDelta attribute was that the NRI integration would be able to capture those states, and return the number of times containers had been OOM kills in between scrapes. What I'm seeing is that the following occurs:

  1. Main container process is OOM Killed
  2. If the NRI integration manages to scrape the Kubelet when the container is in Terminated state, it produced a K8sContainerSample with state = 'Terminated' and reason = 'OOMKilled'. If the NRI integration does not catch the container in Terminated state, that information is lost.
  3. containerOOMEventsDelta remains at 0

I shall mention that containerOOMEventsDelta is working as expected when it's a child process the one that's killed, not the main container. This is a great addition, and something we'd been waiting for (as mentioned in https://www.netice9.com/blog/guide-to-oomkill-alerting-in-kubernetes-clusters OOM kills in child processes can sometimes go unnoticed). I just hoped that containerOOMEventsDelta would also include kills on the main container.

Expected Behavior

  1. Main container process is OOM Killed
  2. If the NRI integration manages to scrape the Kubelet when the container is in Terminated state, it produced a K8sContainerSample with state = 'Terminated' and reason = 'OOMKilled'. If the NRI integration does not catch the container in Terminated state, that information is lost.
  3. containerOOMEventsDelta is reported as 1

Troubleshooting or NR Diag results

Provide any other relevant log data.
TIP: Scrub logs and diagnostic information for sensitive information

Steps to Reproduce

  1. Saturate memory on main container
  2. Wait for OOM kill

Your Environment

Kubernetes 1.24
nri-kubernetes v3.15.1

Additional context

Add any other context about the problem here. For example, relevant community posts or support tickets.

For Maintainers Only or Hero Triaging this bug

Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):

Metadata

Assignees

No one assigned

    Labels

    bugCategorizes issue or PR as related to a bug.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions