Skip to content

handle csi-controller container restarts when upgrading vcenter #3250

@luanaBanana

Description

@luanaBanana

Hi everybody👋

The csi-controller of the deployment restarts when we upgrade vcenter. This makes sense as the controller is connecting to vcenter:

{"level":"error","time":"2025-04-17T12:56:01.270266778Z","caller":"volume/listview.go:258","msg":"WaitForUpdates returned err: destroy property filter failed with Post \"URL/sdk\": read tcp IP->IP: read: connection reset by peer after failing to wait for updates: Post \"https: │
│ {"level":"error","time":"2025-04-17T12:56:01.27559112Z","caller":"vsphere/virtualcenter.go:321","msg":"failed to obtain user session with err: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-dri │
│ {"level":"error","time":"2025-04-17T12:56:01.275647877Z","caller":"vsphere/virtualcenter.go:270","msg":"Cannot connect to vCenter with err: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-driver │
│ {"level":"error","time":"2025-04-17T12:56:01.280137357Z","caller":"vsphere/virtualcenter.go:277","msg":"Could not logout of VC session. Error: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-dri │
│ {"level":"error","time":"2025-04-17T12:56:01.280174099Z","caller":"volume/listview.go:218","msg":"failed to connect to vCenter. err: Post \"URL/sdk\": dial tcp IP: connect: connection refused","TraceId":"1bc50271-a80e-46e3-bc2b-1a7a44ff4290","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg │
...

Issue Summary

When upgrading vCenter, the process can take anywhere from 30 to 60 minutes. During this time, we observe that the associated pod restarts repeatedly, which leads to several operational challenges:

Observed Problems

- False Alerts During Planned Downtime
Our monitoring setup triggers alerts on repeated pod restarts. As a result, we receive alerts during this planned outage. The only current workaround is to manually silence alerts during the upgrade, which is not ideal or scalable for our operations.

- Pod Rescheduling and Volume Attachment Failure
A more critical issue occurs if a pod is evicted or killed (e.g., by the chaos monkey) and rescheduled to another node. Since vCenter is down, the volume cannot be reattached, and the pod remains unscheduled. We assume the pod will remain pending until the upgrade completes and vCenter becomes available again?

- Liveness Probe Limitations
Adjusting liveness probes to prevent restarts doesn't seem practical. The vCenter downtime is too long, and loosening the probes would reduce their usefulness in detecting real failures.

Questions

  • Has anyone else encountered this behavior during vCenter upgrades?
  • Are there best practices or recommended workarounds to mitigate these issues?
  • Any suggestions on how to handle volume reattachment failures more gracefully?

Setup

  • K8s 1.31
  • vSphere Client version 8.0.3.00400
  • vsphere csi: v3.3.1

@clementnuss

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions