Description
Description
We're trying to create dashboards and alerts that capture transient states of Kubernetes container. In particular, we're interested in tracking Error
and OOMKilled
termination states. AFAICT the New Relic integration is not always able to capture OOMKills correctly when the container restarts (comparing to kube_pod_container_status_last_terminated_reason
), because at the moment it scrapes the Kubelet the container has already been restarted and even though at some point in between scrapes the status
changed to Terminated
and the reason
to OOMKilled
, as it is not the current state, it never gets reported.
My hope with the new containerOOMEventsDelta
attribute was that the NRI integration would be able to capture those states, and return the number of times containers had been OOM kills in between scrapes. What I'm seeing is that the following occurs:
- Main container process is OOM Killed
- If the NRI integration manages to scrape the Kubelet when the container is in
Terminated
state, it produced aK8sContainerSample
withstate = 'Terminated'
andreason = 'OOMKilled'
. If the NRI integration does not catch the container inTerminated
state, that information is lost. containerOOMEventsDelta
remains at0
I shall mention that containerOOMEventsDelta
is working as expected when it's a child process the one that's killed, not the main container. This is a great addition, and something we'd been waiting for (as mentioned in https://www.netice9.com/blog/guide-to-oomkill-alerting-in-kubernetes-clusters OOM kills in child processes can sometimes go unnoticed). I just hoped that containerOOMEventsDelta
would also include kills on the main container.
Expected Behavior
- Main container process is OOM Killed
- If the NRI integration manages to scrape the Kubelet when the container is in
Terminated
state, it produced aK8sContainerSample
withstate = 'Terminated'
andreason = 'OOMKilled'
. If the NRI integration does not catch the container inTerminated
state, that information is lost. containerOOMEventsDelta
is reported as1
Troubleshooting or NR Diag results
Provide any other relevant log data.
TIP: Scrub logs and diagnostic information for sensitive information
Steps to Reproduce
- Saturate memory on main container
- Wait for OOM kill
Your Environment
Kubernetes 1.24
nri-kubernetes v3.15.1
Additional context
Add any other context about the problem here. For example, relevant community posts or support tickets.
For Maintainers Only or Hero Triaging this bug
Suggested Priority (P1,P2,P3,P4,P5):
Suggested T-Shirt size (S, M, L, XL, Unknown):