Treat kubelet NodeAffinity status.reason as retryable system error#6461
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6461 +/- ##
==========================================
- Coverage 58.48% 58.48% -0.01%
==========================================
Files 940 940
Lines 71584 71584
==========================================
- Hits 41867 41865 -2
- Misses 26534 26536 +2
Partials 3183 3183
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| "Terminated", | ||
| "NodeShutdown", | ||
| // kubelet admission rejects the pod before the node gets assigned appropriate labels. | ||
| "NodeAffinity", |
There was a problem hiding this comment.
Are we sure that the NodeAffinity reason is enough to determine if this should be retried? Worse case scenario is that we retry something that should not be (not all that bad), but was to gauge confidence here.
There was a problem hiding this comment.
Yeah, this was my concern as well. I have more confidence in that the node is marked as Failed with the following
status: {
message: "Pod was rejected: Predicate NodeAffinity failed: node(s) didn't match Pod's node affinity/selector"
phase: "Failed"
reason: "NodeAffinity"
startTime: "2025-05-16T04:53:42Z"
}
}
.
Given the specific nature of this status, I am assuming a small likelihood that we would have a prevalance of false positives causing us to retry needlessly.
Signed-off-by: Mike Hotan <mike@union.ai>
67c834c to
840006a
Compare
Why are the changes needed?
Identified a unique, fairly rare, edge case in certain cloud providers (e.g GKE) where the kubelet rejects the pod due to lacking NodeAffinity requirements prior the underlying controller manager applying the appropriate labels.
We would see workflows failing with
What changes were proposed in this pull request?
Treat Pod failures with
status.reasonofNodeAffinityas a retryable system error.How was this patch tested?
Unit tests.
This is important to improve the readability of release notes.
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link
Summary by Bito
This pull request enhances the handling of pod failures in cloud environments by treating 'NodeAffinity' status as a retryable error, improving workflow reliability. It also includes new unit tests to verify this behavior.Unit tests added: True
Estimated effort to review (1-5, lower is better): 2