Skip to content

fix: Add ClusterCIDRReady condition #7965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

AlexeyPetroff
Copy link

@AlexeyPetroff AlexeyPetroff commented Apr 2, 2025

Fixes #7875

Summary

This PR addresses an issue where EC2NodeClass resources using AL2023 AMI family would become stuck in a NotReady state after temporary EKS API unavailability. The problem occurs when attempting to resolve Cluster CIDR during reconciliation - if the API call fails, the resource remains in a NotReady state indefinitely, even after the API becomes available again.

Changes

  • Added a new condition type ConditionTypeClusterCIDRReady to track the status of Cluster CIDR resolution separately
  • Modified the Readiness controller to manage this specific condition instead of directly setting the general Ready condition
  • Added the new condition to the list of required conditions in the validation controller

Tested by:

  1. Creating an EC2NodeClass with AL2023 AMI family
  2. Temporarily revoking EKS DescribeCluster permissions to simulate API unavailability
  3. Verifying the EC2NodeClass transitions to NotReady state
  4. Restoring permissions
  5. Verifying the EC2NodeClass correctly transitions back to Ready state during the next reconciliation

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@AlexeyPetroff AlexeyPetroff requested a review from a team as a code owner April 2, 2025 03:30
@AlexeyPetroff AlexeyPetroff requested a review from jmdeal April 2, 2025 03:30
Copy link

netlify bot commented Apr 2, 2025

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit f91373c
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/67ecc399b669720008285f59
😎 Deploy Preview https://deploy-preview-7965--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -43,9 +41,10 @@ func (n Readiness) Reconcile(ctx context.Context, nodeClass *v1.EC2NodeClass) (r
// will not be done at startup but instead in a reconcile loop.
if nodeClass.AMIFamily() == v1.AMIFamilyAL2023 {
if err := n.launchTemplateProvider.ResolveClusterCIDR(ctx); err != nil {
nodeClass.StatusConditions().SetFalse(status.ConditionReady, "NodeClassNotReady", "Failed to detect the cluster CIDR")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting we create a specific condition for Cluster CIDR readiness rather than directly marking the whole NodeClass as NotReady. This way, when the EKS API becomes available again, the validation controller can see that this specific dependency is healthy and allow the NodeClass to recover properly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more inclined to just scope this into the validation controller and drop this sub-reconciler entirely @AlexeyPetroff What do you think about just doing an additional validation?

Copy link
Author

@AlexeyPetroff AlexeyPetroff Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, the decision to do the Cluster CIDR resolution in the reconcile loop was made to support non-EKS clusters, where the CIDR may not be available at the time of creation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that's correct -- but validation can still be special cased based on the AMIFamily -- I think we should perform the same check, but I'm just suggesting that we remove this reconciler entirely and capture this in validation -- I'd prefer not to create a separate status condition for this

@jonathan-innis jonathan-innis self-assigned this Apr 4, 2025
@rschalo rschalo self-assigned this May 5, 2025
@jonathan-innis jonathan-innis added lifecycle/stale triage/needs-information Marks that the issue still needs more information to properly triage labels May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale triage/needs-information Marks that the issue still needs more information to properly triage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Karpenter doesn't update EC2NodeClass status after Failed to detect the cluster CIDR error
3 participants