Skip to content

Enhance pod.containerstatuses metric to include lastTerminationState of failed pods #4128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 29, 2025

Conversation

yadneshk
Copy link
Collaborator

@yadneshk yadneshk commented Mar 3, 2025

Which issue this PR addresses:

Fixes ARO-7793

What this PR does / why we need it:

Add the reason for last termination state of pods stuck in waiting state as a dimension in pod.containerstatuses metric. The value of this field could further be used to segregate pod failures such as OOMKilled.

Test plan for issue:

  • Unit tests
  • CI/CD
  • Canary

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

@LiniSusan
Copy link
Collaborator

LGTM,
@yadneshk : one question: Could you please confirm whether the hourlyRun check is required here or not?

@yadneshk yadneshk force-pushed the yadneshk/ARO-7793-oomkilled-pods branch from 2f4b4ce to c912e13 Compare March 5, 2025 12:30
@yadneshk yadneshk requested a review from wanghaoran1988 as a code owner March 5, 2025 12:30
@yadneshk yadneshk force-pushed the yadneshk/ARO-7793-oomkilled-pods branch from c912e13 to 94755ce Compare March 5, 2025 13:03
@yadneshk yadneshk force-pushed the yadneshk/ARO-7793-oomkilled-pods branch 2 times, most recently from e849fd3 to 4955cb8 Compare March 18, 2025 11:07
kimorris27
kimorris27 previously approved these changes Mar 18, 2025
Copy link
Contributor

@kimorris27 kimorris27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I think it would benefit for the PR to have an additional approval from someone more knowledgeable about the monitor before merging.

@yadneshk yadneshk changed the title Add monitor to track pods terminated due to OOM Enhance pod.containerstatuses metric to include lastTerminationState of failed pods Mar 18, 2025
@kimorris27 kimorris27 dismissed their stale review March 18, 2025 15:40

Withdrawing my approval to wait for canary test results.

pepedocs
pepedocs previously approved these changes Mar 24, 2025
Copy link
Collaborator

@pepedocs pepedocs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @yadneshk .

kimorris27
kimorris27 previously approved these changes Mar 25, 2025
@yadneshk yadneshk dismissed stale reviews from kimorris27 and pepedocs via 7408707 May 21, 2025 12:00
@yadneshk yadneshk force-pushed the yadneshk/ARO-7793-oomkilled-pods branch 2 times, most recently from 7408707 to dc66c97 Compare May 21, 2025 12:51
kimorris27
kimorris27 previously approved these changes May 21, 2025
Capture and include LastTerminationState in pod container status metrics.
This includes pods in the Waiting state (not ready), indicating that their
last termination state must have contributed to their current condition.
@yadneshk yadneshk force-pushed the yadneshk/ARO-7793-oomkilled-pods branch from dc66c97 to 6b91046 Compare May 21, 2025 13:31
@yadneshk yadneshk requested review from bizz001 and pepedocs May 21, 2025 14:48
Copy link
Collaborator

@edisonLcardenas edisonLcardenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I suggest to include a link of the TSG/SOP (e.g. Azure RedHat OpenShift Team Doc ) for this monitor.

@kimorris27 kimorris27 merged commit bca598e into master May 29, 2025
22 checks passed
@kimorris27 kimorris27 deleted the yadneshk/ARO-7793-oomkilled-pods branch May 29, 2025 21:01
mhsreddy pushed a commit that referenced this pull request Jun 2, 2025
Capture and include LastTerminationState in pod container status metrics.
This includes pods in the Waiting state (not ready), indicating that their
last termination state must have contributed to their current condition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-review size-small Size small skippy pull requests raised by member of Team Skippy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants