Skip to content

EKS: Cluster Deletion Fails #32395

Open
Open
@hakenmt

Description

@hakenmt

Describe the bug

A CFN stack containing an EKS cluster failed and attempted to roll back. The OnEventHandler custom resource that's responsible for handling cluster deletion failed to delete the resource with permissions error. From the CW logs:

2024-12-05T07:12:25.344Z	df9e8909-0564-4c9e-9529-5546180edcec	ERROR	{
  clientName: 'EKSClient',
  commandName: 'DeleteClusterCommand',
  input: {
    name: 'multi-az-workshop-EKSNestedStackEKSNestedStackResourceAE427C53-2PG5GFQGAIDA-ClusterEKSClusterEAC9DE5C-N7TGM3G9041D'
  },
  error: AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/multi-az-workshop-EKSNest-ClusterEKSClusterCreation-2nnnu7xJhiuj/AWSCDK.EKSCluster.Delete.9e794145-daf9-44f3-88f2-d0cd7c694239 is not authorized to perform: eks:DeleteCluster on resource: arn:aws:eks:us-west-2:123456789012:cluster/multi-az-workshop-EKSNestedStackEKSNestedStackResourceAE427C53-2PG5GFQGAIDA-ClusterEKSClusterEAC9DE5C-N7TGM3G9041D
      at de_AccessDeniedExceptionRes (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/index.js:2546:21)
      at de_CommandError (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/index.js:2519:19)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-serde/dist-cjs/index.js:35:20
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/core/dist-cjs/index.js:165:18
      at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-retry/dist-cjs/index.js:320:38
      at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/index.js:34:22
      at async Xi.onDelete (/var/task/index.js:57:649490) {
    '$fault': 'client',
    '$metadata': {
      httpStatusCode: 403,
      requestId: '68a18a1d-2119-42f0-aeae-9253766fdf5a',
      extendedRequestId: undefined,
      cfId: undefined,
      attempts: 1,
      totalRetryDelay: 0
    }
  },
  metadata: {
    httpStatusCode: 403,
    requestId: '68a18a1d-2119-42f0-aeae-9253766fdf5a',
    extendedRequestId: undefined,
    cfId: undefined,
    attempts: 1,
    totalRetryDelay: 0
  }
}

However, this doesn't happen all of the time, and am wondering if there is a hidden race condition during a stack rollback where the permissions policy may get deleted before the function and assigned role are deleted?

Regression Issue

  • Select this option if this issue appears to be a regression.

Last Known Working CDK Version

No response

Expected Behavior

I would expect the automatically created custom resource and IAM role to have the appropriate permissions.

Current Behavior

Sometimes, cluster deletion fails with a 403 error.

Reproduction Steps

I don't have specific reproduction steps since the behavior is transient. This is basically the cluster resource definition:

Cluster cluster = new Cluster(this, "EKSCluster", new ClusterProps(){
                Vpc = props.Vpc,
                VpcSubnets = new SubnetSelection[] { new SubnetSelection() { SubnetType = SubnetType.PRIVATE_ISOLATED } },
                DefaultCapacity = 0,
                Version =  KubernetesVersion.V1_31,
                PlaceClusterHandlerInVpc = false,
                EndpointAccess = EndpointAccess.PUBLIC_AND_PRIVATE,
                KubectlLayer = kubetctlLayer,
                SecurityGroup = controlPlaneSG,
                MastersRole = props.AdminRole,
                ClusterName = props.ClusterName,
                ClusterLogging = new ClusterLoggingTypes[] { ClusterLoggingTypes.CONTROLLER_MANAGER, ClusterLoggingTypes.AUTHENTICATOR, ClusterLoggingTypes.API, ClusterLoggingTypes.AUDIT, ClusterLoggingTypes.SCHEDULER}
            });

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.164.1

Framework Version

No response

Node.js Version

20

OS

darwin

Language

.NET

Language Version

No response

Other information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    @aws-cdk/aws-eksRelated to Amazon Elastic Kubernetes ServicebugThis issue is a bug.effort/largeLarge work item – several weeks of effortp2

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions