Skip to content

Occasional retrieving IMDS metadata failed on AL2023 #2262

Open
@brianrowlett

Description

@brianrowlett

/kind bug

We currently have AL2 nodes and have never had a problem with this.

When switching to AL2023 nodes, occasionally the ebs-csi-node will fail to retrieve metadata from IMDS. This only appears to happen at node startup time, if we restart the ebs-csi-node daemonset, it is able to retrieve metadata from IMDS reliably.

It does appear to successfully fallback to getting metadata from Kubernetes, but we think IMDS should not be failing like this.

What happened?

I1211 20:07:09.634316       1 main.go:157] "Initializing metadata"
E1211 20:07:14.635517       1 metadata.go:51] "Retrieving IMDS metadata failed, falling back to Kubernetes metadata" err="could not get EC2 instance identity metadata: operation error ec2imds: GetInstanceIdentityDocument, request canceled, context deadline exceeded"
I1211 20:07:14.645753       1 metadata.go:55] "Retrieved metadata from Kubernetes"
I1211 20:07:14.646110       1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.34.0"
I1211 20:07:16.167040       1 node.go:941] "CSINode Allocatable value is set" nodeName="ip-100-64-153-121.ec2.internal" count=31

What you expected to happen?

I1211 20:24:41.226237       1 main.go:157] "Initializing metadata"
I1211 20:24:42.479940       1 metadata.go:48] "Retrieved metadata from IMDS"
I1211 20:24:42.480783       1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.34.0"
I1211 20:24:43.497952       1 node.go:941] "CSINode Allocatable value is set" nodeName="ip-100-64-251-153.ec2.internal" count=31

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Our launch template looks like:

  NodeLaunchTemplate2023:
    Type: AWS::EC2::LaunchTemplate
    Condition: CreateManagedNodegroup2023
    DependsOn:
    - Cluster
    Properties:
      LaunchTemplateData:
        BlockDeviceMappings:
        - DeviceName: /dev/xvda
          Ebs:
            DeleteOnTermination: true
            Encrypted: true
            VolumeSize: !Ref WorkerVolumeSize
            VolumeType: gp3
        MetadataOptions:
          HttpEndpoint: enabled
          HttpPutResponseHopLimit: 2
          HttpTokens: required
          InstanceMetadataTags: disabled
        NetworkInterfaces:
        - DeviceIndex: 0
          Groups:
          - !GetAtt Cluster.ClusterSecurityGroupId

And our managed nodegroup looks like:

  ManagedNodegroup2023a:
    Type: AWS::EKS::Nodegroup
    Condition: CreateManagedNodegroup2023
    DependsOn:
    - Cluster
    - NodeInstanceRole
    - NodeLaunchTemplate2023
    Properties:
      AmiType: AL2023_x86_64_STANDARD
      CapacityType: ON_DEMAND
      ClusterName: !Ref Cluster
      InstanceTypes:
      - !Ref WorkerInstanceType
      LaunchTemplate:
        Id: !Ref NodeLaunchTemplate2023
        Version: !GetAtt NodeLaunchTemplate2023.LatestVersionNumber
      NodeRole: !GetAtt NodeInstanceRole.Arn
      ScalingConfig:
        DesiredSize: !Ref NodegroupSizeDesired
        MaxSize: !Ref NodegroupSizeMaximum
        MinSize: !Ref NodegroupSizeMinimum
      Subnets:
      - Fn::ImportValue:
          !Sub "${VpcName}-private-a"
      UpdateConfig:
        MaxUnavailable: 1

Environment

  • Kubernetes version (use kubectl version): v1.30.6-eks-7f9249a
  • Driver version: v1.34.0

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions