Skip to content

EC2 Simplified Automatic Recovery conflicts with Karpenter's termination behavior #8821

@ellistarn

Description

@ellistarn

Description

EC2 Simplified Automatic Recovery is enabled by default for many instance types. When triggered, it can hold a node for much longer than it would take for Karpenter to simply replace it.

Karpenter's interruption controller handles spot interruptions and scheduled maintenance via EventBridge, but doesn't watch the system status checks that trigger EC2 Auto Recovery.

As a potential design option, we could:

  1. Always disable auto recovery in maintenanceOptions on Karpenter-managed launch templates
  2. Watch for status check failures via DescribeInstanceStatus and trigger node replacement

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds-triageIssues that need to be triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions