Skip to content

Failed to watch *v1.VolumeAttachment #7663

Open
@madchap

Description

Which component are you using?: cluster-autoscaler on AWS

/area cluster-autoscaler

What version of the component are you using?: 9.45

Component version: Helm chart 9.45

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.3-eks-56e63d8

What environment is this in?: AWS EKS

What did you expect to happen?: I am trying to figure out why the autoscaler does not honor my --ok-total-unready-count=0. It seems the node that enters the NotReady state is stuck with many terminating pods, and I observed at the same time the error in the autoscaler log.

The error is the following:

failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope

When looking at the clusterrole created by the helm chart, I am not seeing this particular resource:

$ k describe clusterrole cluster-autoscaler-aws-cluster-autoscaler
Name:         cluster-autoscaler-aws-cluster-autoscaler
Labels:       app.kubernetes.io/instance=cluster-autoscaler
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=aws-cluster-autoscaler
              helm.sh/chart=cluster-autoscaler-9.45.0
Annotations:  meta.helm.sh/release-name: cluster-autoscaler
              meta.helm.sh/release-namespace: kube-system
PolicyRule:
  Resources                            Non-Resource URLs  Resource Names        Verbs
  ---------                            -----------------  --------------        -----
  endpoints                            []                 []                    [create patch]
  events                               []                 []                    [create patch]
  pods/eviction                        []                 []                    [create]
  leases.coordination.k8s.io           []                 []                    [create]
  jobs.extensions                      []                 []                    [get list patch watch]
  endpoints                            []                 [cluster-autoscaler]  [get update]
  leases.coordination.k8s.io           []                 [cluster-autoscaler]  [get update]
  configmaps                           []                 []                    [list watch get]
  pods/status                          []                 []                    [update]
  nodes                                []                 []                    [watch list create delete get update]
  jobs.batch                           []                 []                    [watch list get patch]
  namespaces                           []                 []                    [watch list get]
  persistentvolumeclaims               []                 []                    [watch list get]
  persistentvolumes                    []                 []                    [watch list get]
  pods                                 []                 []                    [watch list get]
  replicationcontrollers               []                 []                    [watch list get]
  services                             []                 []                    [watch list get]
  daemonsets.apps                      []                 []                    [watch list get]
  replicasets.apps                     []                 []                    [watch list get]
  statefulsets.apps                    []                 []                    [watch list get]
  cronjobs.batch                       []                 []                    [watch list get]
  daemonsets.extensions                []                 []                    [watch list get]
  replicasets.extensions               []                 []                    [watch list get]
  csidrivers.storage.k8s.io            []                 []                    [watch list get]
  csinodes.storage.k8s.io              []                 []                    [watch list get]
  csistoragecapacities.storage.k8s.io  []                 []                    [watch list get]
  storageclasses.storage.k8s.io        []                 []                    [watch list get]
  poddisruptionbudgets.policy          []                 []                    [watch list]

I am not sure, but given the --ok-total-unready-count=0, I would expect the node which enters the NotReady state to be fairly quickly replaced by a node that can handle things.

What happened instead?:
The NotReady node sticks around for quite some time, with bunch of pods in Terminating state. Eventually, it'll go away after some time (maybe 30-45mn).

How to reproduce it (as minimally and precisely as possible):
Something is causing my node to get to NotReady state, I think way too much over-committment on them, especially on memory (then the kubelet then bails out).

I am afraid I can't :-/

Anything else we need to know?:

An log iteration where I see the volumeattachment error:

I0106 17:52:52.606768       1 static_autoscaler.go:274] Starting main loop
I0106 17:52:52.609136       1 aws_manager.go:188] Found multiple availability zones for ASG "eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f"; using eu-central-2b for failure-domain.beta.kubernetes.io/zone label
I0106 17:52:52.758096       1 filter_out_schedulable.go:65] Filtering out schedulables
I0106 17:52:52.758116       1 filter_out_schedulable.go:122] 0 pods marked as unschedulable can be scheduled.
I0106 17:52:52.758125       1 filter_out_schedulable.go:85] No schedulable pods
I0106 17:52:52.758130       1 filter_out_daemon_sets.go:47] Filtered out 0 daemon set pods, 0 unschedulable pods left
I0106 17:52:52.758150       1 static_autoscaler.go:532] No unschedulable pods
I0106 17:52:52.758168       1 static_autoscaler.go:555] Calculating unneeded nodes
I0106 17:52:52.758182       1 pre_filtering_processor.go:67] Skipping ip-10-0-12-37.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758204       1 pre_filtering_processor.go:67] Skipping ip-10-0-28-107.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758209       1 pre_filtering_processor.go:67] Skipping ip-10-0-36-38.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758213       1 pre_filtering_processor.go:67] Skipping ip-10-0-36-82.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758473       1 static_autoscaler.go:598] Scale down status: lastScaleUpTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownDeleteTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownFailTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 scaleDownForbidden=false scaleDownInCooldown=true
I0106 17:52:52.759061       1 orchestrator.go:322] ScaleUpToNodeGroupMinSize: NodeGroup eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f, TargetSize 3, MinSize 3, MaxSize 5
I0106 17:52:52.759135       1 orchestrator.go:366] ScaleUpToNodeGroupMinSize: scale up not needed
I0106 17:52:56.201819       1 reflector.go:349] Listing and watching *v1.VolumeAttachment from pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251
W0106 17:52:56.206308       1 reflector.go:569] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope
E0106 17:52:56.206341       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Failed to watch *v1.VolumeAttachment: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User \"system:serviceaccount:kube-system:cluster-autoscaler\" cannot list resource \"volumeattachments\" in API group \"storage.k8s.io\" at the cluster scope" logger="UnhandledError"
I0106 17:52:57.975501       1 reflector.go:879] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:251: Watch close - *v1.Node total 29 items received

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions