Description
What happened:
The metrics-server fails with error: E0916 14:23:37.254021 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.34.50.99:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-50-99.eu-west-1.compute.internal"
What you expected to happen:
The metrics-server can get metrics from nodes successfully.
Anything else we need to know?:
This problem happens when the autoscaler (we use Karpenter) adds or removes new nodes. For a brief period of time the node will fail to report metrics in the metrics/resource
, causing the HPA to have many FailedToGetResourceMetric
events.
Environment:
-
Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.):
EKS 17 -
Container Network Setup (flannel, calico, etc.):
- Amazon VPC CNI plugin for Kubernetes
- CoreDNS
- KubeProxy
Note: This issue is not network related
- Kubernetes version (use
kubectl version
):
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.12-eks-2f46c53
- Metrics Server manifest
spoiler for Metrics Server manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "6"
meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2023-01-31T14:48:01Z"
generation: 6
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: metrics-server
app.kubernetes.io/version: 0.7.2
application: none
customer-level: none
environment: dev
helm.sh/chart: metrics-server-7.2.14
owner: squad-platform
name: metrics-server
namespace: kube-system
resourceVersion: "386279220"
uid: ed6fd84e-fdb6-46e5-b25c-80a09135476f
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/name: metrics-server
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
kubectl.kubernetes.io/restartedAt: "2023-07-17T14:25:57+02:00"
prometheus.io/path: /metrics
prometheus.io/port: "8443"
prometheus.io/scrape: "true"
creationTimestamp: null
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: metrics-server
app.kubernetes.io/version: 0.7.2
application: none
customer-level: none
environment: dev
helm.sh/chart: metrics-server-7.2.14
owner: squad-platform
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/name: metrics-server
topologyKey: kubernetes.io/hostname
weight: 1
automountServiceAccountToken: true
containers:
- args:
- --secure-port=8443
- --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
- --metric-resolution=20s
command:
- metrics-server
image: docker.io/bitnami/metrics-server:0.7.2-debian-12-r3
imagePullPolicy: Always
livenessProbe:
failureThreshold: 3
httpGet:
path: /livez
port: https
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: metrics-server
ports:
- containerPort: 8443
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readyz
port: https
scheme: HTTPS
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 150m
ephemeral-storage: 2Gi
memory: 192Mi
requests:
cpu: 100m
ephemeral-storage: 50Mi
memory: 128Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsGroup: 1001
runAsNonRoot: true
runAsUser: 1001
seLinuxOptions: {}
seccompProfile:
type: RuntimeDefault
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp
name: empty-dir
subPath: tmp-dir
- mountPath: /opt/bitnami/metrics-server/apiserver.local.config
name: empty-dir
subPath: app-tmp-dir
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1001
fsGroupChangePolicy: Always
serviceAccount: metrics-server
serviceAccountName: metrics-server
terminationGracePeriodSeconds: 30
volumes:
- emptyDir: {}
name: empty-dir
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2023-01-31T14:48:01Z"
lastUpdateTime: "2024-09-16T18:56:36Z"
message: ReplicaSet "metrics-server-75fb4689b7" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
- lastTransitionTime: "2024-09-17T06:56:46Z"
lastUpdateTime: "2024-09-17T06:56:46Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 6
readyReplicas: 1
replicas: 1
updatedReplicas: 1
- Kubelet config:
spoiler for Kubelet config:
{
"kubeletconfig": {
"enableServer": true,
"syncFrequency": "1m0s",
"fileCheckFrequency": "20s",
"httpCheckFrequency": "20s",
"address": "0.0.0.0",
"port": 10250,
"tlsCipherSuites": [
"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305",
"TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
"TLS_RSA_WITH_AES_256_GCM_SHA384",
"TLS_RSA_WITH_AES_128_GCM_SHA256"
],
"serverTLSBootstrap": true,
"authentication": {
"x509": {
"clientCAFile": "/etc/kubernetes/pki/ca.crt"
},
"webhook": {
"enabled": true,
"cacheTTL": "2m0s"
},
"anonymous": {
"enabled": false
}
},
"authorization": {
"mode": "Webhook",
"webhook": {
"cacheAuthorizedTTL": "5m0s",
"cacheUnauthorizedTTL": "30s"
}
},
"registryPullQPS": 5,
"registryBurst": 10,
"eventRecordQPS": 50,
"eventBurst": 100,
"enableDebuggingHandlers": true,
"healthzPort": 10248,
"healthzBindAddress": "127.0.0.1",
"oomScoreAdj": -999,
"clusterDomain": "cluster.local",
"clusterDNS": [
"172.31.0.10"
],
"streamingConnectionIdleTimeout": "4h0m0s",
"nodeStatusUpdateFrequency": "10s",
"nodeStatusReportFrequency": "5m0s",
"nodeLeaseDurationSeconds": 40,
"imageMinimumGCAge": "2m0s",
"imageGCHighThresholdPercent": 85,
"imageGCLowThresholdPercent": 80,
"volumeStatsAggPeriod": "1m0s",
"cgroupRoot": "/",
"cgroupsPerQOS": true,
"cgroupDriver": "systemd",
"cpuManagerPolicy": "none",
"cpuManagerReconcilePeriod": "10s",
"memoryManagerPolicy": "None",
"topologyManagerPolicy": "none",
"topologyManagerScope": "container",
"runtimeRequestTimeout": "2m0s",
"hairpinMode": "hairpin-veth",
"maxPods": 58,
"podPidsLimit": -1,
"resolvConf": "/etc/resolv.conf",
"cpuCFSQuota": true,
"cpuCFSQuotaPeriod": "100ms",
"nodeStatusMaxImages": 50,
"maxOpenFiles": 1000000,
"contentType": "application/vnd.kubernetes.protobuf",
"kubeAPIQPS": 50,
"kubeAPIBurst": 100,
"serializeImagePulls": false,
"evictionHard": {
"memory.available": "100Mi",
"nodefs.available": "10%",
"nodefs.inodesFree": "5%"
},
"evictionPressureTransitionPeriod": "5m0s",
"enableControllerAttachDetach": true,
"protectKernelDefaults": true,
"makeIPTablesUtilChains": true,
"iptablesMasqueradeBit": 14,
"iptablesDropBit": 15,
"featureGates": {
"RotateKubeletServerCertificate": true
},
"failSwapOn": true,
"memorySwap": {},
"containerLogMaxSize": "10Mi",
"containerLogMaxFiles": 5,
"configMapAndSecretChangeDetectionStrategy": "Watch",
"kubeReserved": {
"cpu": "90m",
"ephemeral-storage": "1Gi",
"memory": "893Mi"
},
"systemReservedCgroup": "/system",
"kubeReservedCgroup": "/runtime",
"enforceNodeAllocatable": [
"pods"
],
"volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",
"providerID": "aws:///eu-west-1a/i-02487199a514a5c47",
"logging": {
"format": "text",
"flushFrequency": "5s",
"verbosity": 2,
"options": {
"json": {
"infoBufferSize": "0"
}
}
},
"enableSystemLogHandler": true,
"enableSystemLogQuery": false,
"shutdownGracePeriod": "0s",
"shutdownGracePeriodCriticalPods": "0s",
"enableProfilingHandler": true,
"enableDebugFlagsHandler": true,
"seccompDefault": false,
"memoryThrottlingFactor": 0.9,
"registerNode": true,
"localStorageCapacityIsolation": true,
"containerRuntimeEndpoint": "unix:///run/containerd/containerd.sock"
}
}
- Metrics server logs:
spoiler for Metrics Server logs:
I0917 06:56:25.141023 1 serving.go:374] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0917 06:56:29.056085 1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
I0917 06:56:29.260891 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0917 06:56:29.260914 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0917 06:56:29.260949 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0917 06:56:29.260975 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0917 06:56:29.260993 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0917 06:56:29.260998 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0917 06:56:29.261280 1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key"
I0917 06:56:29.261308 1 secure_serving.go:213] Serving securely on [::]:8443
I0917 06:56:29.261348 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0917 06:56:29.541673 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0917 06:56:29.543057 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0917 06:56:29.543068 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
te error: tls: internal error" node="ip-10-34-40-218.eu-west-1.compute.internal"
E0913 07:11:41.371925 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.36.55:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-36-55.eu-west-1.compute.internal"
E0913 07:15:41.362052 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.37.167:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-37-167.eu-west-1.compute.internal"
E0913 07:19:41.367676 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.34.76:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-34-76.eu-west-1.compute.internal"
E0913 07:23:41.376918 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.43.160:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-43-160.eu-west-1.compute.internal"
E0913 07:26:41.376301 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.241:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-241.eu-west-1.compute.internal"
E0913 07:30:41.339715 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.44.219:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-44-219.eu-west-1.compute.internal"
E0913 07:33:41.354489 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.47.203:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-47-203.eu-west-1.compute.internal"
E0913 07:38:41.359856 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.45.209:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-45-209.eu-west-1.compute.internal"
E0913 07:40:41.350880 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.44.2:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-44-2.eu-west-1.compute.internal"
E0913 07:42:41.372912 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.99:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-99.eu-west-1.compute.internal"
E0913 07:45:41.374758 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.27:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-27.eu-west-1.compute.internal"
- Status of Metrics API:
spolier for Status of Metrics API:
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: app.kubernetes.io/instance=metrics-server
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=metrics-server
app.kubernetes.io/version=0.7.2
application=none
customer-level=none
environment=dev
helm.sh/chart=metrics-server-7.2.14
owner=squad-platform
Annotations: meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: kube-system
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2023-01-31T14:48:01Z
Resource Version: 386279218
UID: 7503894f-8f1f-4f61-9df8-a663cdd0298d
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2024-09-17T06:56:46Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
Events: <none>
/kind bug