Skip to content

aws-eks: Cluster rollback fails all future deployments #31626

Open
@esun74

Description

@esun74

Describe the bug

Cluster rollbacks can persistently break the cloudformation stack.

The issue occurs when 1) A cluster re-creation is triggered and 2) The deployment rolls back after the new cluster is created. When rolling back to the original cluster, parameters from the new (but now deleted) cluster are retained and fail all future deployments even if the cluster recreation commit is rolled back.

Essentially, while rolling back the stack, it needs to also roll back cached cluster information back to the original cluster's details.

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

No response

Expected Behavior

Rollbacks should leave the stack in a functional state, reverting or fixing the CDK should allow new deployments to succeed.

Current Behavior

Rollbacks leave the stack in a dysfunctional state, reverting the CDK code still results in failed stacks.

Reproduction Steps

import { KubectlV24Layer } from '@aws-cdk/lambda-layer-kubectl-v24';
import { Cluster, KubernetesVersion } from 'aws-cdk-lib/aws-eks';
import { Construct } from 'constructs';
const addBreakingChange = false;

const cluster = new Cluster(this, 'MyCluster', {
  kubectlLayer: new KubectlV24Layer(this, 'KubectlLayer'),
  version: KubernetesVersion.V1_24,
  defaultCapacity: 0,
  clusterName: addBreakingChange ? 'newcluster' : undefined,
});

if (addBreakingChange) {
  cluster.addManifest('eks-sample-linux-service', {
    apiVersion: 'v1',
    kind: 'Service',
    metadata: {
      name: 'eks-sample-linux-service',
      namespace: 'eks-sample-app',
      labels: {
        app: 'non-existent-app',
      },
    },
  });
}

Deployment 1: Create the cluster
Deployment 2: Change addBreakingChange to true, deployment fails
Deployment 3: Revert addBreakingChange to false, deployments will still fail

Without any additional complications, the failure message is

Resource handler returned message: "Error occurred while DescribeSecurityGroups. EC2 Error Code: InvalidGroup.NotFound. EC2 Error Message: The security group 'sg-[...]' does not exist (Service: Lambda, Status Code: 400, Request ID: [...])" (RequestToken: [...], HandlerErrorCode: InvalidRequest)

Possible Solution

Workaround: I have found out that adding a tag to the cluster successfully triggers IsComplete to update the cluster parameters, e.g. security group.

Additional Information/Context

Support Case ID: 172677804000994

CDK CLI Version

2.113.0 (build ccd534a)

Framework Version

No response

Node.js Version

18

OS

Amazon Linux 2 x86_64

Language

TypeScript

Language Version

TypeScript (5.0.4)

Other information

No response

Metadata

Metadata

Assignees

Labels

@aws-cdk/aws-eksRelated to Amazon Elastic Kubernetes ServicebugThis issue is a bug.effort/mediumMedium work item – several days of effortp1

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions