Description
Describe the bug
A CFN stack containing an EKS cluster failed and attempted to roll back. The OnEventHandler custom resource that's responsible for handling cluster deletion failed to delete the resource with permissions error. From the CW logs:
2024-12-05T07:12:25.344Z df9e8909-0564-4c9e-9529-5546180edcec ERROR {
clientName: 'EKSClient',
commandName: 'DeleteClusterCommand',
input: {
name: 'multi-az-workshop-EKSNestedStackEKSNestedStackResourceAE427C53-2PG5GFQGAIDA-ClusterEKSClusterEAC9DE5C-N7TGM3G9041D'
},
error: AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/multi-az-workshop-EKSNest-ClusterEKSClusterCreation-2nnnu7xJhiuj/AWSCDK.EKSCluster.Delete.9e794145-daf9-44f3-88f2-d0cd7c694239 is not authorized to perform: eks:DeleteCluster on resource: arn:aws:eks:us-west-2:123456789012:cluster/multi-az-workshop-EKSNestedStackEKSNestedStackResourceAE427C53-2PG5GFQGAIDA-ClusterEKSClusterEAC9DE5C-N7TGM3G9041D
at de_AccessDeniedExceptionRes (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/index.js:2546:21)
at de_CommandError (/var/runtime/node_modules/@aws-sdk/client-eks/dist-cjs/index.js:2519:19)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-serde/dist-cjs/index.js:35:20
at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/core/dist-cjs/index.js:165:18
at async /var/runtime/node_modules/@aws-sdk/node_modules/@smithy/middleware-retry/dist-cjs/index.js:320:38
at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/index.js:34:22
at async Xi.onDelete (/var/task/index.js:57:649490) {
'$fault': 'client',
'$metadata': {
httpStatusCode: 403,
requestId: '68a18a1d-2119-42f0-aeae-9253766fdf5a',
extendedRequestId: undefined,
cfId: undefined,
attempts: 1,
totalRetryDelay: 0
}
},
metadata: {
httpStatusCode: 403,
requestId: '68a18a1d-2119-42f0-aeae-9253766fdf5a',
extendedRequestId: undefined,
cfId: undefined,
attempts: 1,
totalRetryDelay: 0
}
}
However, this doesn't happen all of the time, and am wondering if there is a hidden race condition during a stack rollback where the permissions policy may get deleted before the function and assigned role are deleted?
Regression Issue
- Select this option if this issue appears to be a regression.
Last Known Working CDK Version
No response
Expected Behavior
I would expect the automatically created custom resource and IAM role to have the appropriate permissions.
Current Behavior
Sometimes, cluster deletion fails with a 403 error.
Reproduction Steps
I don't have specific reproduction steps since the behavior is transient. This is basically the cluster resource definition:
Cluster cluster = new Cluster(this, "EKSCluster", new ClusterProps(){
Vpc = props.Vpc,
VpcSubnets = new SubnetSelection[] { new SubnetSelection() { SubnetType = SubnetType.PRIVATE_ISOLATED } },
DefaultCapacity = 0,
Version = KubernetesVersion.V1_31,
PlaceClusterHandlerInVpc = false,
EndpointAccess = EndpointAccess.PUBLIC_AND_PRIVATE,
KubectlLayer = kubetctlLayer,
SecurityGroup = controlPlaneSG,
MastersRole = props.AdminRole,
ClusterName = props.ClusterName,
ClusterLogging = new ClusterLoggingTypes[] { ClusterLoggingTypes.CONTROLLER_MANAGER, ClusterLoggingTypes.AUTHENTICATOR, ClusterLoggingTypes.API, ClusterLoggingTypes.AUDIT, ClusterLoggingTypes.SCHEDULER}
});
Possible Solution
No response
Additional Information/Context
No response
CDK CLI Version
2.164.1
Framework Version
No response
Node.js Version
20
OS
darwin
Language
.NET
Language Version
No response
Other information
No response