-
Notifications
You must be signed in to change notification settings - Fork 26
Description
/kind bug
What steps did you take and what happened:
After upgrading to Nutanix Cloud Provider v0.6.0, the cloud-controller-manager no longer discovers newly added VMs. This results in nodes either being continuously recreated or remaining tainted and unschedulable.
Issue 1: VM repeatedly recreated / VM not found
After provisioning a VM and joining it to the Kubernetes cluster, the node keeps getting recreated. The following logs are observed in the nutanix-cloud-controller-manager pod:
I1218 15:26:11.890746 1 node_controller.go:429] Initializing node devk8s01-md-worker-default-jz9kk-khkw5 with cloud provider
2025-12-18 15:26:11.891Z INFO - GET https://10.10.201.20:9440/api/vmm/v4.1/ahv/config/vms/3fbf7e7f-509d-4612-b58a-c28eb2f5fc39
2025-12-18 15:26:11.906Z INFO - HTTP/1.1 404 NOT FOUND
I1218 15:26:11.906873 1 node_controller.go:233] error syncing 'devk8s01-md-worker-default-jz9kk-khkw5': failed to get instance metadata for node devk8s01-md-worker-default-jz9kk-khkw5: failed to get VM: API call failed: {"data":{"error":[{"message":"Failed to perform the operation on the VM with UUID '3fbf7e7f-509d-4612-b58a-c28eb2f5fc39', because it is not found.","severity":"ERROR","code":"VMM-30100","locale":"en-US","errorGroup":"VM_NOT_FOUND","argumentsMap":{"vm_uuid":"3fbf7e7f-509d-4612-b58a-c28eb2f5fc39"},"$objectType":"vmm.v4.error.AppMessage"}],"$errorItemDiscriminator":"List<vmm.v4.error.AppMessage>","$objectType":"vmm.v4.error.ErrorResponse"},"$dataItemDiscriminator":"vmm.v4.error.ErrorResponse"}, requeuing
E1218 15:26:11.906912 1 node_controller.go:244] "Unhandled Error" err="error syncing 'devk8s01-md-worker-default-jz9kk-khkw5': failed to get instance metadata for node devk8s01-md-worker-default-jz9kk-khkw5: failed to get VM: API call failed: {...}, requeuing" logger="UnhandledError"
The VM exists and is visible in Prism, but the CCM fails to retrieve it via the API and continuously retries.
Issue 2: VM discovered but node remains tainted
On another cluster, the VM is successfully found, but the node never becomes initialized and remains tainted:
2025-12-23 15:09:46.712Z INFO - GET https://10.10.201.20:9440/api/vmm/v4.1/ahv/config/vms/74d90883-9a67-461c-5a75-f8070d9a6470
2025-12-23 15:09:46.736Z INFO - HTTP/1.1 200 OK
I1223 15:09:46.738543 1 node_controller.go:233] error syncing 'pock8s01-md-worker-default-kxh9x-wl2wt': failed to get instance metadata for node pock8s01-md-worker-default-kxh9x-wl2wt: unable to determine network interfaces from VM with UUID 74d90883-9a67-461c-5a75-f8070d9a6470, requeuing
The node remains tainted indefinitely:
Taints:
node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
node.cluster.x-k8s.io/uninitialized:NoSchedule
Unschedulable: false
What did you expect to happen:
New nodes are added to the cluster and the "unschedulable" taints are removed.
Anything else you would like to add:
VMs are created from a machine template hosted on prism central.
Workaround: Downgrading to a previous Nutanix Cloud Provider version restores normal behavior.
Environment:
- prism central version: 7.3
- cloud-provider-nutanix version: v0.6.0
- Kubernetes version: (use
kubectl version): v1.33.5 - OS (e.g. from
/etc/os-release): Ubuntu 22.04