Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monitor to track pods terminated due to OOM #4128

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yadneshk
Copy link
Collaborator

@yadneshk yadneshk commented Mar 3, 2025

Which issue this PR addresses:

Fixes https://issues.redhat.com/browse/ARO-7793

What this PR does / why we need it:

Pods killed by OOM often enter a CrashLoopBackOff state and keep restarting. While the restart counter may reflect these restarts, it doesn’t explicitly report when pods are being terminated due to OOM.

This change introduces a new monitor that emits the metric pod.oomkilled, specifically tracking pods terminated with the reason OOMKilled. The monitor identifies pods in a terminated state and reports OOM terminations, providing clearer visibility into OOM-related failures.

Test plan for issue:

  • Unit tests
  • CI/CD

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

@LiniSusan
Copy link
Collaborator

LGTM,
@yadneshk : one question: Could you please confirm whether the hourlyRun check is required here or not?

@yadneshk yadneshk force-pushed the yadneshk/ARO-7793-oomkilled-pods branch from 2f4b4ce to c912e13 Compare March 5, 2025 12:30
@yadneshk yadneshk requested a review from wanghaoran1988 as a code owner March 5, 2025 12:30
Introduce a new monitor that emits metrics "pod.oomkilled" to
report pods that were killed due to OOM. This monitor looks
for pods that are in terminated state and has reason as "OOMKilled".
@yadneshk yadneshk force-pushed the yadneshk/ARO-7793-oomkilled-pods branch from c912e13 to 94755ce Compare March 5, 2025 13:03
"name": p.Name,
"namespace": p.Namespace,
"nodeName": p.Spec.NodeName,
"containername": cntrStatus.Name,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it make sense if we add the pod name as a dimension?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size-small Size small skippy pull requests raised by member of Team Skippy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants