Failed to watch *v1.VolumeAttachment

**Which component are you using?**: cluster-autoscaler on AWS


/area cluster-autoscaler

**What version of the component are you using?**: 9.45



Component version: Helm chart 9.45

**What k8s version are you using (`kubectl version`)?**: 

<details><summary><code>kubectl version</code> Output</summary><br><pre>
$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.3-eks-56e63d8

</pre></details>

**What environment is this in?**: AWS EKS



**What did you expect to happen?**: I am trying to figure out why the autoscaler does not honor my ` --ok-total-unready-count=0`. It seems the node that enters the `NotReady` state is stuck with many terminating pods, and I observed at the same time the error in the autoscaler log.

The error is the following:
```
failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope
```

When looking at the clusterrole created by the helm chart, I am not seeing this particular resource:
```
$ k describe clusterrole cluster-autoscaler-aws-cluster-autoscaler
Name:         cluster-autoscaler-aws-cluster-autoscaler
Labels:       app.kubernetes.io/instance=cluster-autoscaler
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=aws-cluster-autoscaler
              helm.sh/chart=cluster-autoscaler-9.45.0
Annotations:  meta.helm.sh/release-name: cluster-autoscaler
              meta.helm.sh/release-namespace: kube-system
PolicyRule:
  Resources                            Non-Resource URLs  Resource Names        Verbs
  ---------                            -----------------  --------------        -----
  endpoints                            []                 []                    [create patch]
  events                               []                 []                    [create patch]
  pods/eviction                        []                 []                    [create]
  leases.coordination.k8s.io           []                 []                    [create]
  jobs.extensions                      []                 []                    [get list patch watch]
  endpoints                            []                 [cluster-autoscaler]  [get update]
  leases.coordination.k8s.io           []                 [cluster-autoscaler]  [get update]
  configmaps                           []                 []                    [list watch get]
  pods/status                          []                 []                    [update]
  nodes                                []                 []                    [watch list create delete get update]
  jobs.batch                           []                 []                    [watch list get patch]
  namespaces                           []                 []                    [watch list get]
  persistentvolumeclaims               []                 []                    [watch list get]
  persistentvolumes                    []                 []                    [watch list get]
  pods                                 []                 []                    [watch list get]
  replicationcontrollers               []                 []                    [watch list get]
  services                             []                 []                    [watch list get]
  daemonsets.apps                      []                 []                    [watch list get]
  replicasets.apps                     []                 []                    [watch list get]
  statefulsets.apps                    []                 []                    [watch list get]
  cronjobs.batch                       []                 []                    [watch list get]
  daemonsets.extensions                []                 []                    [watch list get]
  replicasets.extensions               []                 []                    [watch list get]
  csidrivers.storage.k8s.io            []                 []                    [watch list get]
  csinodes.storage.k8s.io              []                 []                    [watch list get]
  csistoragecapacities.storage.k8s.io  []                 []                    [watch list get]
  storageclasses.storage.k8s.io        []                 []                    [watch list get]
  poddisruptionbudgets.policy          []                 []                    [watch list]
```


I am not sure, but given the ` --ok-total-unready-count=0`, I would expect the node which enters the `NotReady` state to be fairly quickly replaced by a node that can handle things.

**What happened instead?**:
The `NotReady` node sticks around for quite some time, with bunch of pods in `Terminating` state. Eventually, it'll go away after some time (maybe 30-45mn).



**How to reproduce it (as minimally and precisely as possible)**:
Something is causing my node to get to `NotReady` state, I think way too much over-committment on them, especially on memory (then the kubelet then bails out).


I am afraid I can't :-/

**Anything else we need to know?**:


An log iteration where I see the volumeattachment error:

```
I0106 17:52:52.606768       1 static_autoscaler.go:274] Starting main loop
I0106 17:52:52.609136       1 aws_manager.go:188] Found multiple availability zones for ASG "eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f"; using eu-central-2b for failure-domain.beta.kubernetes.io/zone label
I0106 17:52:52.758096       1 filter_out_schedulable.go:65] Filtering out schedulables
I0106 17:52:52.758116       1 filter_out_schedulable.go:122] 0 pods marked as unschedulable can be scheduled.
I0106 17:52:52.758125       1 filter_out_schedulable.go:85] No schedulable pods
I0106 17:52:52.758130       1 filter_out_daemon_sets.go:47] Filtered out 0 daemon set pods, 0 unschedulable pods left
I0106 17:52:52.758150       1 static_autoscaler.go:532] No unschedulable pods
I0106 17:52:52.758168       1 static_autoscaler.go:555] Calculating unneeded nodes
I0106 17:52:52.758182       1 pre_filtering_processor.go:67] Skipping ip-10-0-12-37.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758204       1 pre_filtering_processor.go:67] Skipping ip-10-0-28-107.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758209       1 pre_filtering_processor.go:67] Skipping ip-10-0-36-38.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758213       1 pre_filtering_processor.go:67] Skipping ip-10-0-36-82.eu-central-2.compute.internal - node group min size reached (current: 3, min: 3)
I0106 17:52:52.758473       1 static_autoscaler.go:598] Scale down status: lastScaleUpTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownDeleteTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 lastScaleDownFailTime=2025-01-06 16:16:32.949347114 +0000 UTC m=-3582.400670434 scaleDownForbidden=false scaleDownInCooldown=true
I0106 17:52:52.759061       1 orchestrator.go:322] ScaleUpToNodeGroupMinSize: NodeGroup eks-default_node_group-20241211130258966500000008-7cc9dae8-63f0-63d5-bce1-642871ebd84f, TargetSize 3, MinSize 3, MaxSize 5
I0106 17:52:52.759135       1 orchestrator.go:366] ScaleUpToNodeGroupMinSize: scale up not needed
I0106 17:52:56.201819       1 reflector.go:349] Listing and watching *v1.VolumeAttachment from pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251
W0106 17:52:56.206308       1 reflector.go:569] pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope
E0106 17:52:56.206341       1 reflector.go:166] "Unhandled Error" err="pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251: Failed to watch *v1.VolumeAttachment: failed to list *v1.VolumeAttachment: volumeattachments.storage.k8s.io is forbidden: User \"system:serviceaccount:kube-system:cluster-autoscaler\" cannot list resource \"volumeattachments\" in API group \"storage.k8s.io\" at the cluster scope" logger="UnhandledError"
I0106 17:52:57.975501       1 reflector.go:879] pkg/mod/k8s.io/client-go@v0.32.0/tools/cache/reflector.go:251: Watch close - *v1.Node total 29 items received
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to watch *v1.VolumeAttachment #7663

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development