Open
Description
/kind bug
We currently have AL2 nodes and have never had a problem with this.
When switching to AL2023 nodes, occasionally the ebs-csi-node will fail to retrieve metadata from IMDS. This only appears to happen at node startup time, if we restart the ebs-csi-node daemonset, it is able to retrieve metadata from IMDS reliably.
It does appear to successfully fallback to getting metadata from Kubernetes, but we think IMDS should not be failing like this.
What happened?
I1211 20:07:09.634316 1 main.go:157] "Initializing metadata"
E1211 20:07:14.635517 1 metadata.go:51] "Retrieving IMDS metadata failed, falling back to Kubernetes metadata" err="could not get EC2 instance identity metadata: operation error ec2imds: GetInstanceIdentityDocument, request canceled, context deadline exceeded"
I1211 20:07:14.645753 1 metadata.go:55] "Retrieved metadata from Kubernetes"
I1211 20:07:14.646110 1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.34.0"
I1211 20:07:16.167040 1 node.go:941] "CSINode Allocatable value is set" nodeName="ip-100-64-153-121.ec2.internal" count=31
What you expected to happen?
I1211 20:24:41.226237 1 main.go:157] "Initializing metadata"
I1211 20:24:42.479940 1 metadata.go:48] "Retrieved metadata from IMDS"
I1211 20:24:42.480783 1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.34.0"
I1211 20:24:43.497952 1 node.go:941] "CSINode Allocatable value is set" nodeName="ip-100-64-251-153.ec2.internal" count=31
How to reproduce it (as minimally and precisely as possible)?
Anything else we need to know?:
Our launch template looks like:
NodeLaunchTemplate2023:
Type: AWS::EC2::LaunchTemplate
Condition: CreateManagedNodegroup2023
DependsOn:
- Cluster
Properties:
LaunchTemplateData:
BlockDeviceMappings:
- DeviceName: /dev/xvda
Ebs:
DeleteOnTermination: true
Encrypted: true
VolumeSize: !Ref WorkerVolumeSize
VolumeType: gp3
MetadataOptions:
HttpEndpoint: enabled
HttpPutResponseHopLimit: 2
HttpTokens: required
InstanceMetadataTags: disabled
NetworkInterfaces:
- DeviceIndex: 0
Groups:
- !GetAtt Cluster.ClusterSecurityGroupId
And our managed nodegroup looks like:
ManagedNodegroup2023a:
Type: AWS::EKS::Nodegroup
Condition: CreateManagedNodegroup2023
DependsOn:
- Cluster
- NodeInstanceRole
- NodeLaunchTemplate2023
Properties:
AmiType: AL2023_x86_64_STANDARD
CapacityType: ON_DEMAND
ClusterName: !Ref Cluster
InstanceTypes:
- !Ref WorkerInstanceType
LaunchTemplate:
Id: !Ref NodeLaunchTemplate2023
Version: !GetAtt NodeLaunchTemplate2023.LatestVersionNumber
NodeRole: !GetAtt NodeInstanceRole.Arn
ScalingConfig:
DesiredSize: !Ref NodegroupSizeDesired
MaxSize: !Ref NodegroupSizeMaximum
MinSize: !Ref NodegroupSizeMinimum
Subnets:
- Fn::ImportValue:
!Sub "${VpcName}-private-a"
UpdateConfig:
MaxUnavailable: 1
Environment
- Kubernetes version (use
kubectl version
): v1.30.6-eks-7f9249a - Driver version: v1.34.0