handle csi-controller container restarts when upgrading vcenter

Hi everybody👋 

The csi-controller of the deployment restarts when we upgrade vcenter. This makes sense as the controller is connecting to vcenter:

```
{"level":"error","time":"2025-04-17T12:56:01.270266778Z","caller":"volume/listview.go:258","msg":"WaitForUpdates returned err: destroy property filter failed with Post \"URL/sdk\": read tcp IP->IP: read: connection reset by peer after failing to wait for updates: Post \"https: │
│ {"level":"error","time":"2025-04-17T12:56:01.27559112Z","caller":"vsphere/virtualcenter.go:321","msg":"failed to obtain user session with err: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-dri │
│ {"level":"error","time":"2025-04-17T12:56:01.275647877Z","caller":"vsphere/virtualcenter.go:270","msg":"Cannot connect to vCenter with err: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-driver │
│ {"level":"error","time":"2025-04-17T12:56:01.280137357Z","caller":"vsphere/virtualcenter.go:277","msg":"Could not logout of VC session. Error: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-dri │
│ {"level":"error","time":"2025-04-17T12:56:01.280174099Z","caller":"volume/listview.go:218","msg":"failed to connect to vCenter. err: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg │
...
 ```

###  Issue Summary

When upgrading vCenter, the process can take anywhere from 30 to 60 minutes. During this time, we observe that the associated pod restarts repeatedly, which leads to several operational challenges:

### Observed Problems

**- False Alerts During Planned Downtime**
Our monitoring setup triggers alerts on repeated pod restarts. As a result, we receive alerts during this planned outage. The only current workaround is to manually silence alerts during the upgrade, which is not ideal or scalable for our operations.

**- Pod Rescheduling and Volume Attachment Failure**
A more critical issue occurs if a pod is evicted or killed (e.g., by the chaos monkey) and rescheduled to another node. Since vCenter is down, the volume cannot be reattached, and the pod remains unscheduled. We assume the pod will remain pending until the upgrade completes and vCenter becomes available again?

**- Liveness Probe Limitations**
Adjusting liveness probes to prevent restarts doesn't seem practical. The vCenter downtime is too long, and loosening the probes would reduce their usefulness in detecting real failures.

### Questions

- Has anyone else encountered this behavior during vCenter upgrades?
- Are there best practices or recommended workarounds to mitigate these issues?
- Any suggestions on how to handle volume reattachment failures more gracefully?


### Setup

* K8s 1.31
* vSphere Client version 8.0.3.00400
* vsphere csi: v3.3.1


@clementnuss 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

handle csi-controller container restarts when upgrading vcenter #3250

Issue Summary

Observed Problems

Questions

Setup

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

handle csi-controller container restarts when upgrading vcenter #3250

Description

Issue Summary

Observed Problems

Questions

Setup

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions