-
Notifications
You must be signed in to change notification settings - Fork 28
Add cleanup of 'Failed' jobs #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
k8s/agent-scaler/agent-scaler.sh
Outdated
|
|
||
| # cleanup explicitly Failed jobs | ||
| # Filter by AGENT_NAME and job status condition "Failed"="True" | ||
| for job in $(kubectl get jobs -n "${namespace}" -o jsonpath='{range .items[?(@.status.conditions[*].type=="Failed" && @.status.conditions[*].status=="True")]}{.metadata.name}{"\\n"}{end}' 2>/dev/null | { grep -E "^${AGENT_NAME}-[0-9]+(-[0-9]+)?$" || true; }); do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will get a bit tricky in bash, but I think we should delay the deletion of those failed jobs by 1 day to give some time for debugging (I'm not sure if we store the logs of the failed jobs somewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dang. I forgot about this issue. Cleaning up failed jobs would have prevented the runaway use of pods that I have been trying to 'fix' these last few days #124. If this PR had been in place, there would have been less old jobs claiming they were in need of servicing because the FAILED tasks would have been cleaned up. I pushed a change that will keep FAILED around for a day (good idea) but I need to deploy it to test it (I tested pieces of the change but...). Will do it after #124 goes in. Thanks @johscheuer
johscheuer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 Thought the agent scaler gets a bit messy with all this bash :)
Yeah. Should probably replace it. |
johscheuer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Add --ignore-not-found=true to delete. Cleans up some complaint when
two scripts running beside each other and one deletes first (happens
when testing changes to this script). Minor item.
Also, clean up 'Failed' jobs else they just hang out.
Here are example Failed jobs.