Skip to content

Treat kubelet NodeAffinity status.reason as retryable system error#6461

Merged
mhotan merged 1 commit into
masterfrom
mike/nodeaffinity-handling
May 22, 2025
Merged

Treat kubelet NodeAffinity status.reason as retryable system error#6461
mhotan merged 1 commit into
masterfrom
mike/nodeaffinity-handling

Conversation

@mhotan

@mhotan mhotan commented May 22, 2025

Copy link
Copy Markdown
Contributor

Why are the changes needed?

Identified a unique, fairly rare, edge case in certain cloud providers (e.g GKE) where the kubelet rejects the pod due to lacking NodeAffinity requirements prior the underlying controller manager applying the appropriate labels.

We would see workflows failing with

[...]: [...] currentAttempt done. Last Error: USER::Pod was rejected: Predicate NodeAffinity failed: node(s) didn't match Pod's node affinity/selector

What changes were proposed in this pull request?

Treat Pod failures with status.reason of NodeAffinity as a retryable system error.

How was this patch tested?

Unit tests.

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This pull request enhances the handling of pod failures in cloud environments by treating 'NodeAffinity' status as a retryable error, improving workflow reliability. It also includes new unit tests to verify this behavior.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 2

@mhotan mhotan added the fixed For any bug fixes label May 22, 2025
@codecov

codecov Bot commented May 22, 2025

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.48%. Comparing base (a75cea0) to head (840006a).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6461      +/-   ##
==========================================
- Coverage   58.48%   58.48%   -0.01%     
==========================================
  Files         940      940              
  Lines       71584    71584              
==========================================
- Hits        41867    41865       -2     
- Misses      26534    26536       +2     
  Partials     3183     3183              
Flag Coverage Δ
unittests-datacatalog 59.03% <ø> (ø)
unittests-flyteadmin 56.23% <ø> (-0.03%) ⬇️
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 64.78% <ø> (+0.05%) ⬆️
unittests-flyteidl 76.12% <ø> (ø)
unittests-flyteplugins 60.95% <100.00%> (ø)
unittests-flytepropeller 54.78% <ø> (ø)
unittests-flytestdlib 64.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

"Terminated",
"NodeShutdown",
// kubelet admission rejects the pod before the node gets assigned appropriate labels.
"NodeAffinity",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that the NodeAffinity reason is enough to determine if this should be retried? Worse case scenario is that we retry something that should not be (not all that bad), but was to gauge confidence here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was my concern as well. I have more confidence in that the node is marked as Failed with the following

status: {
message: "Pod was rejected: Predicate NodeAffinity failed: node(s) didn't match Pod's node affinity/selector"
phase: "Failed"
reason: "NodeAffinity"
startTime: "2025-05-16T04:53:42Z"
}
}

.

Given the specific nature of this status, I am assuming a small likelihood that we would have a prevalance of false positives causing us to retry needlessly.

@mhotan mhotan force-pushed the mike/nodeaffinity-handling branch from 67c834c to 840006a Compare May 22, 2025 11:49
@mhotan mhotan merged commit 5f98022 into master May 22, 2025
49 checks passed
@mhotan mhotan deleted the mike/nodeaffinity-handling branch May 22, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fixed For any bug fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants