-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
bugSomething isn't workingSomething isn't workingneeds-triageIssues that need to be triagedIssues that need to be triaged
Description
Description
EC2 Simplified Automatic Recovery is enabled by default for many instance types. When triggered, it can hold a node for much longer than it would take for Karpenter to simply replace it.
Karpenter's interruption controller handles spot interruptions and scheduled maintenance via EventBridge, but doesn't watch the system status checks that trigger EC2 Auto Recovery.
As a potential design option, we could:
- Always disable auto recovery in maintenanceOptions on Karpenter-managed launch templates
- Watch for status check failures via DescribeInstanceStatus and trigger node replacement
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
youwalther65, thebearmayor and cnmcavoy
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds-triageIssues that need to be triagedIssues that need to be triaged