Skip to content

vSphere CSI does not recover after temporary proxy outage — new session created but all operations fail with “The session is not authenticated” #3936

Description

@upics

Environment

  • vSphere CSI Driver version: v3.6.0
  • Kubernetes distribution: k3s
  • k3s version: 1.33
  • vSphere version: 8.0.3.00700
  • Cluster type: Vanilla Kubernetes (k3s)
  • Networking: vCenter is reachable only through an HTTP/HTTPS proxy
  • Proxy configuration: Injected via pod environment variables (HTTP_PROXY, HTTPS_PROXY, NO_PROXY) in vsphere-csi-controller

Summary
When the outbound proxy becomes temporarily unavailable, the vSphere CSI controller begins receiving 503 Service Unavailable errors when communicating with vCenter.
This is expected during a network disruption.
The problem is that after the proxy becomes available again, the CSI driver does not recover properly.
Although the controller successfully creates a new vCenter session, all subsequent CSI operations fail with:

  • ServerFaultCode: The session is not authenticated
  • 401 Unauthorized
  • ListView destruction failure due to invalid session
  • CNS QueryAllVolume failing
  • Property collector WaitForUpdates failing

The system remains in this broken state until the vsphere-csi-controller pod is manually restarted.

Steps to Reproduce

  1. Deploy vSphere CSI on a k3s cluster where vsphere-csi-controller uses an outbound proxy to reach vCenter.
  2. Temporarily disrupt or block the proxy (simulate network drop or forced outage).
  3. Observe repeated CSI errors such as:
    • Post https:///sdk: Service Unavailable
    • Failures retrieving datacenters and datastore maps
  4. Restore proxy connectivity.
  5. Observe that CSI:
    • Attempts to reconnect
    • Logs creation of a new vCenter session
  6. Despite the successful reconnection, CSI continues to fail with authentication errors indefinitely.

Expected Behavior
After the proxy comes back online, the CSI driver should:

  • Successfully authenticate with vCenter
  • Refresh internal state (sessions, listviews, property collectors, tagging clients, etc.)
  • Resume normal operation without requiring a pod restart

Actual Behavior

  • Proxy outage triggers 503 Service Unavailable failures — expected.
  • CSI fails to properly clean up existing sessions (Logout also fails because proxy is down).
  • CSI eventually creates a new session:
New session ID = <id>
VirtualCenter.connect() successfully created new client

  • Immediately after that, every operation from CSI fails with:

    ServerFaultCode: The session is not authenticated
    or
    401 Unauthorized

  • Even ListView teardown and recreation fails intermittently:

failed to destroy listview object. err: The session is not authenticated

  • The controller never recovers until manually restarted.

logs attached below

vsphere-csi-controller-6d5486f67c-59x78_vsphere-csi-controller.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions