Skip to content

metrics-server fails to get metrics from recently started nodes #1571

Open
@IuryAlves

Description

@IuryAlves

What happened:

The metrics-server fails with error: E0916 14:23:37.254021 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.34.50.99:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-50-99.eu-west-1.compute.internal"

What you expected to happen:

The metrics-server can get metrics from nodes successfully.

Anything else we need to know?:

This problem happens when the autoscaler (we use Karpenter) adds or removes new nodes. For a brief period of time the node will fail to report metrics in the metrics/resource, causing the HPA to have many FailedToGetResourceMetric events.

Environment:

  • Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.):
    EKS 17

  • Container Network Setup (flannel, calico, etc.):

  • Amazon VPC CNI plugin for Kubernetes
  • CoreDNS
  • KubeProxy

Note: This issue is not network related

  • Kubernetes version (use kubectl version):
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.12-eks-2f46c53
  • Metrics Server manifest
spoiler for Metrics Server manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "6"
    meta.helm.sh/release-name: metrics-server
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2023-01-31T14:48:01Z"
  generation: 6
  labels:
    app.kubernetes.io/instance: metrics-server
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: metrics-server
    app.kubernetes.io/version: 0.7.2
    application: none
    customer-level: none
    environment: dev
    helm.sh/chart: metrics-server-7.2.14
    owner: squad-platform
  name: metrics-server
  namespace: kube-system
  resourceVersion: "386279220"
  uid: ed6fd84e-fdb6-46e5-b25c-80a09135476f
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: metrics-server
      app.kubernetes.io/name: metrics-server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2023-07-17T14:25:57+02:00"
        prometheus.io/path: /metrics
        prometheus.io/port: "8443"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: metrics-server
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: metrics-server
        app.kubernetes.io/version: 0.7.2
        application: none
        customer-level: none
        environment: dev
        helm.sh/chart: metrics-server-7.2.14
        owner: squad-platform
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/instance: metrics-server
                  app.kubernetes.io/name: metrics-server
              topologyKey: kubernetes.io/hostname
            weight: 1
      automountServiceAccountToken: true
      containers:
      - args:
        - --secure-port=8443
        - --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
        - --metric-resolution=20s
        command:
        - metrics-server
        image: docker.io/bitnami/metrics-server:0.7.2-debian-12-r3
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: metrics-server
        ports:
        - containerPort: 8443
          name: https
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 150m
            ephemeral-storage: 2Gi
            memory: 192Mi
          requests:
            cpu: 100m
            ephemeral-storage: 50Mi
            memory: 128Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
          runAsGroup: 1001
          runAsNonRoot: true
          runAsUser: 1001
          seLinuxOptions: {}
          seccompProfile:
            type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: empty-dir
          subPath: tmp-dir
        - mountPath: /opt/bitnami/metrics-server/apiserver.local.config
          name: empty-dir
          subPath: app-tmp-dir
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1001
        fsGroupChangePolicy: Always
      serviceAccount: metrics-server
      serviceAccountName: metrics-server
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: empty-dir
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2023-01-31T14:48:01Z"
    lastUpdateTime: "2024-09-16T18:56:36Z"
    message: ReplicaSet "metrics-server-75fb4689b7" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-09-17T06:56:46Z"
    lastUpdateTime: "2024-09-17T06:56:46Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 6
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
  • Kubelet config:
spoiler for Kubelet config:
{
  "kubeletconfig": {
    "enableServer": true,
    "syncFrequency": "1m0s",
    "fileCheckFrequency": "20s",
    "httpCheckFrequency": "20s",
    "address": "0.0.0.0",
    "port": 10250,
    "tlsCipherSuites": [
      "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
      "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305",
      "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
      "TLS_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_RSA_WITH_AES_128_GCM_SHA256"
    ],
    "serverTLSBootstrap": true,
    "authentication": {
      "x509": {
        "clientCAFile": "/etc/kubernetes/pki/ca.crt"
      },
      "webhook": {
        "enabled": true,
        "cacheTTL": "2m0s"
      },
      "anonymous": {
        "enabled": false
      }
    },
    "authorization": {
      "mode": "Webhook",
      "webhook": {
        "cacheAuthorizedTTL": "5m0s",
        "cacheUnauthorizedTTL": "30s"
      }
    },
    "registryPullQPS": 5,
    "registryBurst": 10,
    "eventRecordQPS": 50,
    "eventBurst": 100,
    "enableDebuggingHandlers": true,
    "healthzPort": 10248,
    "healthzBindAddress": "127.0.0.1",
    "oomScoreAdj": -999,
    "clusterDomain": "cluster.local",
    "clusterDNS": [
      "172.31.0.10"
    ],
    "streamingConnectionIdleTimeout": "4h0m0s",
    "nodeStatusUpdateFrequency": "10s",
    "nodeStatusReportFrequency": "5m0s",
    "nodeLeaseDurationSeconds": 40,
    "imageMinimumGCAge": "2m0s",
    "imageGCHighThresholdPercent": 85,
    "imageGCLowThresholdPercent": 80,
    "volumeStatsAggPeriod": "1m0s",
    "cgroupRoot": "/",
    "cgroupsPerQOS": true,
    "cgroupDriver": "systemd",
    "cpuManagerPolicy": "none",
    "cpuManagerReconcilePeriod": "10s",
    "memoryManagerPolicy": "None",
    "topologyManagerPolicy": "none",
    "topologyManagerScope": "container",
    "runtimeRequestTimeout": "2m0s",
    "hairpinMode": "hairpin-veth",
    "maxPods": 58,
    "podPidsLimit": -1,
    "resolvConf": "/etc/resolv.conf",
    "cpuCFSQuota": true,
    "cpuCFSQuotaPeriod": "100ms",
    "nodeStatusMaxImages": 50,
    "maxOpenFiles": 1000000,
    "contentType": "application/vnd.kubernetes.protobuf",
    "kubeAPIQPS": 50,
    "kubeAPIBurst": 100,
    "serializeImagePulls": false,
    "evictionHard": {
      "memory.available": "100Mi",
      "nodefs.available": "10%",
      "nodefs.inodesFree": "5%"
    },
    "evictionPressureTransitionPeriod": "5m0s",
    "enableControllerAttachDetach": true,
    "protectKernelDefaults": true,
    "makeIPTablesUtilChains": true,
    "iptablesMasqueradeBit": 14,
    "iptablesDropBit": 15,
    "featureGates": {
      "RotateKubeletServerCertificate": true
    },
    "failSwapOn": true,
    "memorySwap": {},
    "containerLogMaxSize": "10Mi",
    "containerLogMaxFiles": 5,
    "configMapAndSecretChangeDetectionStrategy": "Watch",
    "kubeReserved": {
      "cpu": "90m",
      "ephemeral-storage": "1Gi",
      "memory": "893Mi"
    },
    "systemReservedCgroup": "/system",
    "kubeReservedCgroup": "/runtime",
    "enforceNodeAllocatable": [
      "pods"
    ],
    "volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",
    "providerID": "aws:///eu-west-1a/i-02487199a514a5c47",
    "logging": {
      "format": "text",
      "flushFrequency": "5s",
      "verbosity": 2,
      "options": {
        "json": {
          "infoBufferSize": "0"
        }
      }
    },
    "enableSystemLogHandler": true,
    "enableSystemLogQuery": false,
    "shutdownGracePeriod": "0s",
    "shutdownGracePeriodCriticalPods": "0s",
    "enableProfilingHandler": true,
    "enableDebugFlagsHandler": true,
    "seccompDefault": false,
    "memoryThrottlingFactor": 0.9,
    "registerNode": true,
    "localStorageCapacityIsolation": true,
    "containerRuntimeEndpoint": "unix:///run/containerd/containerd.sock"
  }
}
  • Metrics server logs:
spoiler for Metrics Server logs:
I0917 06:56:25.141023       1 serving.go:374] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0917 06:56:29.056085       1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
I0917 06:56:29.260891       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0917 06:56:29.260914       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0917 06:56:29.260949       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0917 06:56:29.260975       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0917 06:56:29.260993       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0917 06:56:29.260998       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0917 06:56:29.261280       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key"
I0917 06:56:29.261308       1 secure_serving.go:213] Serving securely on [::]:8443
I0917 06:56:29.261348       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0917 06:56:29.541673       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0917 06:56:29.543057       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0917 06:56:29.543068       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
te error: tls: internal error" node="ip-10-34-40-218.eu-west-1.compute.internal"
E0913 07:11:41.371925       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.36.55:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-36-55.eu-west-1.compute.internal"
E0913 07:15:41.362052       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.37.167:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-37-167.eu-west-1.compute.internal"
E0913 07:19:41.367676       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.34.76:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-34-76.eu-west-1.compute.internal"
E0913 07:23:41.376918       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.43.160:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-43-160.eu-west-1.compute.internal"
E0913 07:26:41.376301       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.241:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-241.eu-west-1.compute.internal"
E0913 07:30:41.339715       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.44.219:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-44-219.eu-west-1.compute.internal"
E0913 07:33:41.354489       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.47.203:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-47-203.eu-west-1.compute.internal"
E0913 07:38:41.359856       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.45.209:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-45-209.eu-west-1.compute.internal"
E0913 07:40:41.350880       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.44.2:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-44-2.eu-west-1.compute.internal"
E0913 07:42:41.372912       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.99:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-99.eu-west-1.compute.internal"
E0913 07:45:41.374758       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.27:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-27.eu-west-1.compute.internal"
  • Status of Metrics API:
spolier for Status of Metrics API:
Name:         v1beta1.metrics.k8s.io
Namespace:
Labels:       app.kubernetes.io/instance=metrics-server
            app.kubernetes.io/managed-by=Helm
            app.kubernetes.io/name=metrics-server
            app.kubernetes.io/version=0.7.2
            application=none
            customer-level=none
            environment=dev
            helm.sh/chart=metrics-server-7.2.14
            owner=squad-platform
Annotations:  meta.helm.sh/release-name: metrics-server
            meta.helm.sh/release-namespace: kube-system
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
Creation Timestamp:  2023-01-31T14:48:01Z
Resource Version:    386279218
UID:                 7503894f-8f1f-4f61-9df8-a663cdd0298d
Spec:
Group:                     metrics.k8s.io
Group Priority Minimum:    100
Insecure Skip TLS Verify:  true
Service:
  Name:            metrics-server
  Namespace:       kube-system
  Port:            443
Version:           v1beta1
Version Priority:  100
Status:
Conditions:
  Last Transition Time:  2024-09-17T06:56:46Z
  Message:               all checks passed
  Reason:                Passed
  Status:                True
  Type:                  Available
Events:                    <none>

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/supportCategorizes issue or PR as a support question.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions