This guide covers common issues encountered when operating CAPHV (Cluster API Provider Harvester), their symptoms, root causes, and fixes. It assumes familiarity with Cluster API concepts (Cluster, Machine, MachineDeployment, ClusterClass) and Harvester HCI (VMs, IPPools, cloud-init).
All commands assume you have kubectl configured to reach the management cluster unless stated otherwise. Commands targeting the Harvester cluster or a workload cluster are explicitly noted.
- IPPool Issues
- Cloud-Init Issues
- DHCP Issues
- Turtles / Rancher Import Issues
- VM Creation Issues
- Machine Not Becoming Ready
- etcd Issues
- Useful Commands for Debugging
Symptoms:
- New HarvesterMachine objects stay in
Provisioningstate indefinitely. - The HarvesterMachine condition
VMIPAllocatedshowsFalsewith reasonVMIPPoolExhausted. - Controller logs show:
failed to allocate new IP from pool: no IP addresses available in range set.
Cause: All IPs in the configured IPPool range have been allocated. This can happen when the pool is undersized for the cluster, or when previously deleted machines left leaked allocations (see IP Leak below).
Fix:
- Check the current pool state on the Harvester cluster:
# On Harvester kubectl get ippool <pool-name> -n <namespace> -o jsonpath='{.status.available}'
- If the available count is 0, either expand the pool range or free leaked IPs (see IP Leak section).
- To expand the pool, edit the IPPool on Harvester:
Adjust
# On Harvester kubectl edit ippool <pool-name> -n <namespace>
spec.ranges[0].rangeEndto include more addresses. - If you created the IPPool via the HarvesterCluster spec (inline
ipPool), update the HarvesterCluster object'svmNetworkConfig.ipPool.rangeEndfield and the controller will reconcile the pool.
Symptoms:
- HarvesterCluster condition
VMIPPoolReadyisFalse. - Controller logs show:
failed to get IPPoolorIPPool not found. - Machines never receive an IP allocation.
Cause:
The vmNetworkConfig.ipPoolRef in the HarvesterCluster spec references a pool name that does not exist in the target namespace on Harvester, or the namespace does not match spec.targetNamespace.
Fix:
- Verify the pool exists on Harvester in the correct namespace:
# On Harvester kubectl get ippool -A - Confirm the
ipPoolRefvalue in your HarvesterCluster matches<namespace>/<name>or just<name>if the pool is in the same namespace astargetNamespace:kubectl get harvestercluster <name> -n <ns> -o jsonpath='{.spec.vmNetworkConfig.ipPoolRef}'
- If using ClusterClass with a topology variable for
ipPoolRef, verify the variable value in the Cluster'sspec.topology.variables. - The
ipPoolRefformat must be<namespace>/<name>when the pool is in a different namespace fromtargetNamespace. A bare name is resolved againsttargetNamespace.
Symptoms:
- A machine was deleted but the pool's
status.availablecount did not increase. - The IP still appears in
status.allocatedof the IPPool on Harvester. - New machines fail to get that IP even though the original machine no longer exists.
Cause:
The CAPHV controller calls Store.Release() during machine deletion to free the IP. If the controller was not running during deletion (e.g., it was restarting or the finalizer was removed manually), the release may not have happened. The IPPool status.allocated map retains the stale entry.
Fix:
- Identify the leaked IP:
# On Harvester kubectl get ippool <pool-name> -n <ns> -o jsonpath='{.status.allocated}' | python3 -m json.tool
- Cross-reference with existing HarvesterMachine objects:
kubectl get harvestermachine -A -o custom-columns=NAME:.metadata.name,IP:.status.allocatedIPAddress
- For any IP in
status.allocatedwhose machine no longer exists, manually release it:Remove the leaked entry from# On Harvester kubectl edit ippool <pool-name> -n <ns>
status.allocatedand incrementstatus.availableby the number of entries removed. - To prevent future leaks, ensure the CAPHV controller is always running and that HarvesterMachine finalizers (
harvestermachine.infrastructure.cluster.x-k8s.io) are never removed manually.
Symptoms:
- Two VMs on Harvester have the same IP address.
- IP conflicts on the network (ARP flapping, intermittent connectivity).
- Both HarvesterMachine objects show the same value in
status.allocatedIPAddress.
Cause:
This was a bug in versions prior to v0.2.0 where Store.Reserve() did not update status.allocated correctly. Every call to AllocateVMIPFromPool() saw an empty allocated map and returned the first IP in the range.
Fix:
- Upgrade CAPHV to v0.2.0 or later. The
Store.Reserve()function now properly writes tostatus.allocatedbefore returning. - For an already-affected cluster, manually resolve the conflict:
# On Harvester - check allocated map kubectl get ippool <pool-name> -n <ns> -o jsonpath='{.status.allocated}' | python3 -m json.tool
- Delete one of the conflicting machines and let CAPHV recreate it with a new unique IP:
kubectl delete machine <conflicting-machine-name> -n <ns>
- The MachineHealthCheck (if configured) will detect the missing machine and trigger replacement automatically.
Symptoms:
- The VM boots but network interfaces are unconfigured.
cloud-init status --longon the VM shows success, but no IP is set on eth0./var/log/cloud-init.logshows warnings about unrecognized network-config format.
Cause: Harvester uses cloud-init network-config version 1. SLES/openSUSE with wicked requires v1 format. If the network config is accidentally provided in v2 format (e.g., from a custom template or manual edit), wicked ignores it entirely.
CAPHV generates v1 format automatically via buildNetworkDataStatic():
version: 1
config:
- type: physical
name: eth0
subnets:
- type: static
address: 172.16.3.40
netmask: 255.255.0.0
gateway: 172.16.0.1Fix:
- SSH into the VM and check the rendered network config:
cat /var/lib/cloud/seed/nocloud/network-config
- Confirm it starts with
version: 1. If it showsversion: 2, the cloud-init secret on Harvester was built incorrectly. - Check the cloud-init secret on Harvester:
# On Harvester kubectl get secret <machine-name>-cloud-init -n <target-ns> -o jsonpath='{.data.networkdata}' | base64 -d
- If using custom bootstrap data, ensure your bootstrap provider does not inject a network-config v2 payload that overrides the CAPHV-generated v1 config.
Symptoms:
- The VM boots but cloud-init does not apply userdata or networkdata.
cloud-init status --longshows "no datasource" or the config appears empty.- The cloud-init secret exists on Harvester but the VM ignores it.
Cause:
KubeVirt's CloudInitNoCloud datasource reads secret keys as lowercase userdata and networkdata. If the keys are camelCase (userData, networkData), they are silently ignored. This is a KubeVirt requirement, not a cloud-init one.
Fix:
- Check the secret key names on Harvester:
# On Harvester kubectl get secret <machine-name>-cloud-init -n <target-ns> -o jsonpath='{.data}' | python3 -m json.tool
- The keys must be exactly
userdataandnetworkdata(all lowercase). If they are wrong, the secret was created outside of CAPHV or by a modified version. - CAPHV always generates lowercase keys. If you are manually creating secrets, ensure the key names match exactly.
Symptoms:
- The VM is running but packages are not installed, SSH keys are not configured, or RKE2 is not bootstrapped.
cloud-init status --longon the VM shows an error orstatus: not started.
Cause: Multiple possible causes:
- The cloud-init secret does not exist on Harvester.
- The secret exists but is in the wrong namespace.
- The qemu-guest-agent is not running (prevents IP reporting, not cloud-init itself).
- Cloud-init ran but encountered errors in the userdata script.
Fix:
- SSH into the VM (or use the Harvester VNC console) and check cloud-init status:
cloud-init status --long
- Check cloud-init logs:
cat /var/log/cloud-init.log cat /var/log/cloud-init-output.log
- Verify the cloud-init secret exists on Harvester:
# On Harvester kubectl get secret <machine-name>-cloud-init -n <target-ns>
- Verify the VM references the secret correctly:
Look for the
# On Harvester kubectl get vm <machine-name> -n <target-ns> -o jsonpath='{.spec.template.spec.volumes}' | python3 -m json.tool
cloudInitNoCloudvolume source with the correct secret name. - If cloud-init completed with errors, fix the userdata and delete + recreate the machine to trigger a fresh cloud-init run. Cloud-init only runs once per instance ID.
Symptoms:
- The VM boots and cloud-init reports success, but the static IP configured in networkdata is replaced by a link-local address or no address at all.
- Running
ip addr show eth0shows a different IP than expected, or169.254.x.x.
Cause: SLES uses wicked as its network manager. The wicked "nanny" daemon periodically reconciles interface configuration. If the networkdata secret was not properly generated or applied, wicked falls back to its default behavior (often link-local or no address).
Fix:
- SSH into the VM and check what wicked thinks the config should be:
wicked show-config wicked ifstatus eth0
- Check if cloud-init wrote the network config to the correct location:
cat /etc/sysconfig/network/ifcfg-eth0
- Verify that the
networkdatakey exists in the cloud-init secret on Harvester (see Cloud-Init Secret Keys section). - If the networkdata is correct in the secret but not applied, check that the cloud-init NoCloud datasource found the network config:
cat /var/lib/cloud/seed/nocloud/network-config
- As a last resort, manually configure the interface:
ip addr add <address>/<prefix> dev eth0 ip route add default via <gateway>
Symptoms:
- A VM configured for DHCP never gets an IP address.
wicked ifup eth0hangs or times out.journalctl -u wickedshows no DHCP offers received.- Running
tcpdump -i eth0 port 67 or port 68shows DHCP offers arriving at the interface but wicked does not process them.
Cause:
This is a kernel/wicked incompatibility on KubeVirt's virtio-net interfaces. Wicked uses AF_PACKET with SOCK_DGRAM and attaches a BPF filter that uses link-layer (Ethernet header) offsets. However, SOCK_DGRAM strips the link-layer header before delivering data to BPF, so BPF sees network-layer data at link-layer offsets. The result: every DHCP response is silently dropped by the filter.
This affects SLES, openSUSE, and any distro using wicked as the DHCP client on KubeVirt/virtio-net.
Fix:
CAPHV v0.2.3+ automatically works around this by injecting ISC dhclient via cloud-init bootcmd. ISC dhclient uses AF_PACKET with SOCK_RAW (Linux Packet Filter / LPF), which preserves the link-layer header for BPF. This makes DHCP work correctly.
If you are on an older CAPHV version:
- Upgrade to v0.2.3 or later.
- If you cannot upgrade, manually install
dhcp-client(ISC dhclient) in the VM image and configure it instead of wicked for DHCP.
Symptoms:
- The VM is connected to a network with a DHCP server, but DHCP requests from the VM never reach the external DHCP server.
- The VM gets an IP from KubeVirt's internal DHCP (often a 10.x.x.x address) instead of the expected subnet.
Cause: KubeVirt's default bridge binding creates a bridge between the pod network and the VM interface. This bridge runs its own in-VM DHCP server that intercepts DHCP traffic. External DHCP servers on the physical network cannot be reached directly.
Fix:
- CAPHV handles this correctly: it uses an in-VM dhclient that gets its lease from the KubeVirt bridge DHCP server, which in turn mirrors the pod IP.
- If you need external DHCP (e.g., from a physical DHCP server on the VLAN), you must use masquerade or passthrough binding instead of bridge. However, CAPHV's default configuration uses bridge binding, which works with the dhclient workaround.
- For standard CAPHV usage (IPPool-based static IPs or DHCP mode), no action is needed. The in-VM DHCP client obtains the correct address.
Symptoms:
- The VM hangs during boot and never finishes cloud-init.
cloud-init statusshowsstatus: runningindefinitely.- RKE2 never starts because cloud-init never completes.
Cause:
If dhclient is started with the -d flag (foreground/debug mode), it never forks to the background. Since it is launched from a cloud-init bootcmd, cloud-init waits for the command to exit. dhclient in foreground mode runs forever, blocking all subsequent cloud-init stages (including RKE2 bootstrap).
Fix:
- CAPHV uses
-1(try once, fork to background): dhclient sends one DHCP request, obtains a lease, and the parent process exits so cloud-init can continue. The child process remains in the background to handle lease renewals. - Never use
-din cloud-init bootcmd. If you see a custom bootstrap template using-d, replace it with-1. - The correct dhclient invocation generated by CAPHV is:
dhclient -1 -sf /usr/local/bin/dhclient-script-caphv.sh -lf /tmp/dhclient-eth0.lease -pf /tmp/dhclient-eth0.pid eth0
Symptoms:
- In DHCP mode, the VM gets an IP via dhclient, but shortly after boot the IP disappears or changes.
wicked ifstatus eth0shows wicked reconfiguring the interface.
Cause:
If a networkdata key is present in the cloud-init secret, cloud-init writes network configuration files that wicked's nanny daemon picks up. The nanny then overwrites whatever dhclient configured, replacing the DHCP-assigned IP with whatever wicked thinks the config should be (often nothing, since the networkdata may not have a valid DHCP stanza for wicked).
Fix:
- CAPHV correctly omits the
networkdatakey from the cloud-init secret when operating in DHCP mode. This prevents wicked from interfering. - Verify there is no
networkdatakey in the secret:In DHCP mode, only# On Harvester kubectl get secret <machine-name>-cloud-init -n <target-ns> -o jsonpath='{.data}' | python3 -m json.tool
userdatashould be present. - If
networkdatais present in DHCP mode, check if a custom bootstrap template is injecting it. Remove any network-config injection from your bootstrap provider configuration.
Symptoms:
- The CAPI Cluster has the label
cluster-api.cattle.io/rancher-auto-import=truebut the cluster never appears in Rancher. - Turtles controller logs show:
ca-certs setting value is empty. - The
clusters.provisioning.cattle.ioobject is created but the management cluster object is not.
Cause:
When Rancher is deployed with tls=external (TLS is terminated by an external load balancer or reverse proxy like Traefik), Rancher does not set the cacerts setting by default. Turtles in strict TLS mode (agent-tls-mode=true) requires cacerts to be non-empty to validate the CA chain for the cattle-cluster-agent.
Fix:
- Set the
cacertssetting on Rancher to the CA certificate chain used by the TLS terminator:# On Rancher management cluster # Get the current setting kubectl get settings.management.cattle.io cacerts -o yaml # Replace with your CA chain (e.g., Let's Encrypt E7 intermediate + ISRG Root X1) kubectl replace -f - <<'EOF' apiVersion: management.cattle.io/v3 kind: Setting metadata: name: cacerts value: | -----BEGIN CERTIFICATE----- <your intermediate CA cert> -----END CERTIFICATE----- -----BEGIN CERTIFICATE----- <your root CA cert> -----END CERTIFICATE----- EOF
- After changing
cacerts, restart the Turtles controller:kubectl rollout restart deploy/rancher-turtles-controller-manager -n cattle-turtles-system
- Verify the setting took effect:
kubectl get settings.management.cattle.io cacerts -o jsonpath='{.value}' | head -2
Symptoms:
- The cluster appears in Rancher but shows status "Waiting for agent to connect".
- The
cattle-cluster-agentdeployment in the workload cluster is in CrashLoopBackOff or not running.
Cause: The cattle-cluster-agent running on the workload cluster cannot connect back to the Rancher server. Common causes:
- DNS resolution failure: the workload cluster cannot resolve the Rancher hostname.
- Certificate mismatch: the agent does not trust the Rancher server certificate.
- Network connectivity: firewall rules blocking HTTPS from the workload cluster to Rancher.
Fix:
- Check the cattle-cluster-agent logs on the workload cluster:
# On workload cluster kubectl logs deploy/cattle-cluster-agent -n cattle-system -f - Verify DNS resolution from within the workload cluster:
# On workload cluster kubectl run -it --rm dnstest --image=busybox --restart=Never -- nslookup rancher.example.com - Check the
servercaConfigMap incattle-system:This should contain the CA certificate chain that the agent uses to verify Rancher's TLS certificate.# On workload cluster kubectl get configmap serverca -n cattle-system -o yaml - If the CA is wrong, update the
servercaConfigMap with the correct chain and restart the agent:# On workload cluster kubectl rollout restart deploy/cattle-cluster-agent -n cattle-system - Test HTTPS connectivity from a pod in the workload cluster:
kubectl run -it --rm curltest --image=curlimages/curl --restart=Never -- \ curl -vk https://rancher.example.com/healthz
Symptoms:
- The CAPI Cluster object exists and is healthy, but Rancher never creates a management cluster entry for it.
- No
clusters.provisioning.cattle.ioobject is created.
Cause: Turtles auto-import requires a specific label on the CAPI Cluster object. Without it, Turtles ignores the cluster.
Fix:
- Verify the label is present:
kubectl get cluster <name> -n <ns> -o jsonpath='{.metadata.labels.cluster-api\.cattle\.io/rancher-auto-import}'
- If missing, add it:
kubectl label cluster <name> -n <ns> cluster-api.cattle.io/rancher-auto-import=true
- Verify Turtles is running and watching the correct namespace:
kubectl get deploy rancher-turtles-controller-manager -n cattle-turtles-system kubectl logs deploy/rancher-turtles-controller-manager -n cattle-turtles-system -f
- If using ClusterClass, ensure the label is included in the Cluster topology metadata, not just the template. CAPHV's ClusterClass generator includes this label by default when Rancher integration is enabled.
Symptoms:
- You updated the
cacertssetting but existing imports are still failing with the old CA. - New imports work but previously failed imports remain stuck.
Cause:
The Turtles controller caches the CA setting. After changing cacerts, a controller restart is required for the new value to take effect.
Fix:
- Restart the Turtles controller:
kubectl rollout restart deploy/rancher-turtles-controller-manager -n cattle-turtles-system
- For clusters that were already stuck, delete and re-create the CAPI import resources:
# Delete the stuck provisioning cluster (Turtles will recreate it) kubectl delete clusters.provisioning.cattle.io <name> -n default
- Wait for Turtles to re-trigger the import. Monitor the Turtles logs:
kubectl logs deploy/rancher-turtles-controller-manager -n cattle-turtles-system -f
Symptoms:
- The VM object exists on Harvester but never transitions past
Schedulingphase. kubectl get vm <name> -n <ns>shows the VM in Scheduling state.- No VMI (VirtualMachineInstance) is created.
Cause:
- Insufficient resources on Harvester nodes (CPU, memory).
- The VM image referenced in the volume does not exist.
- Node affinity rules prevent scheduling on any available node.
Fix:
- Check Harvester node resources:
# On Harvester kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory - Check VM events:
Look for scheduling-related events at the bottom.
# On Harvester kubectl describe vm <name> -n <ns>
- Verify the VM image exists:
# On Harvester kubectl get virtualmachineimages -n <ns>
- If using nodeAffinity in HarvesterMachineSpec, verify the labels match nodes on Harvester:
# On Harvester kubectl get nodes --show-labels
Symptoms:
- The VM shows as
Startingbut never transitions toRunning. - The VMI exists but shows
SchedulingorPending. - Events mention missing secrets or SSH keypairs.
Cause:
- The cloud-init secret referenced by the VM does not exist.
- The SSH keypair referenced in the HarvesterMachine does not exist on Harvester.
- PVCs backing the VM disks are not bound.
Fix:
- Check the VMI events:
# On Harvester kubectl describe vmi <name> -n <ns>
- Verify the cloud-init secret exists:
# On Harvester kubectl get secret <machine-name>-cloud-init -n <target-ns>
- Verify the SSH keypair exists on Harvester:
# On Harvester kubectl get keypairs.harvesterhci.io -n <ns>
- Check PVC status:
All PVCs should be in
# On Harvester kubectl get pvc -n <target-ns> -l harvesterhci.io/creator=caphv
Boundstate.
Symptoms:
- The VM is not created because the controller failed to create PVCs.
- Controller logs show:
failed to create PVCorinvalid image name. - HarvesterMachine condition
VMProvisioningReadyisFalsewith reasonVMProvisioningFailed.
Cause:
- The
imageNamein the Volume spec does not match the formatnamespace/name. Image names with underscores (e.g.,default/sles15-sp7-minimal-vm.x86_64-cloud-qu2) are valid as of v0.2.0 (theCheckNamespacedNameregex was fixed to allow underscores). - The Longhorn storage class for the image does not exist on Harvester.
- The volume spec references a
storageClassthat does not exist.
Fix:
- Verify the image exists on Harvester and note its exact name:
# On Harvester kubectl get virtualmachineimages -A - Verify the storage class exists:
For image volumes, the expected storage class is
# On Harvester kubectl get storageclasslonghorn-<imageName>. - Fix the
imageNamein the HarvesterMachineTemplate to usenamespace/nameformat:volumes: - volumeType: image imageName: default/sles15-sp7-minimal-vm volumeSize: 40Gi bootOrder: 1
- Check controller logs for the exact error:
kubectl logs deploy/caphv-controller-manager -n caphv-system | grep -i "pvc\|volume\|image"
Symptoms:
- The VM boots from the wrong disk (e.g., a blank data disk instead of the OS image disk).
- The VM enters a boot loop or PXE boot screen.
Cause:
The bootOrder field in the Volume spec determines which disk KubeVirt tries to boot from first. Lower numbers boot first. If bootOrder is not set (0), disks boot in the order they appear in the spec.
Fix:
- Check the current boot order on the VM:
# On Harvester kubectl get vm <name> -n <ns> -o jsonpath='{.spec.template.spec.domain.devices.disks}' | python3 -m json.tool
- Set explicit bootOrder in the HarvesterMachineTemplate:
volumes: - volumeType: image imageName: default/sles15-sp7-minimal-vm volumeSize: 40Gi bootOrder: 1 # Boot from this disk first - volumeType: storageClass storageClass: longhorn volumeSize: 10Gi bootOrder: 2 # Data disk, do not boot from this
- After updating the template, existing machines are not affected. Delete and recreate machines to apply the new boot order, or scale down and back up for MachineDeployment-managed workers.
Symptoms:
- The Machine object in CAPI stays in
Provisionedstate but never becomesRunning. - The node exists in the workload cluster but
kubectl get nodes -o wideshows an emptyPROVIDER-ID. - The cloud-provider-harvester pod is in CrashLoopBackOff or Pending on the workload cluster.
Cause:
The cloud-provider-harvester typically sets the providerID on each node. However, it cannot schedule or function until CNI (Calico/Flannel) is running, and CNI cannot run until the node.cloudprovider.kubernetes.io/uninitialized taint is removed. This is a chicken-and-egg problem.
CAPHV v0.2.0+ solves this by setting the providerID and removing the taint directly from the management cluster via InitializeWorkloadNode().
Fix:
- Upgrade CAPHV to v0.2.0 or later. The controller automatically:
- Sets
spec.providerIDon the workload node via a Kubernetes API patch. - Removes the
node.cloudprovider.kubernetes.io/uninitializedtaint.
- Sets
- If you cannot upgrade, manually fix the node:
# On workload cluster # Set providerID (use the Harvester VM's name as the ID) kubectl patch node <node-name> --type=merge -p '{"spec":{"providerID":"harvester://<vm-name>"}}' # Remove the taint kubectl taint nodes <node-name> node.cloudprovider.kubernetes.io/uninitialized-
- Verify the cloud-provider-harvester deployment on the workload cluster has the correct bootstrap configuration:
hostNetwork: trueanddnsPolicy: ClusterFirstWithHostNet(CNI not ready at boot).- Toleration for
node.cloudprovider.kubernetes.io/uninitialized. replicas: 1(hostNetwork prevents port binding conflicts with multiple replicas).
Symptoms:
- The workload cluster node exists but no pods (including CNI) schedule on it.
kubectl describe node <name>showsTaints: node.cloudprovider.kubernetes.io/uninitialized:NoSchedule.- CNI pods are Pending.
Cause:
Same chicken-and-egg as above. When --cloud-provider=external is set, kubelet adds this taint at startup. It expects an external cloud provider to remove it. If the cloud provider cannot run (because CNI is not ready, which requires this taint to be removed), nothing progresses.
Fix: CAPHV v0.2.0+ handles this automatically. See the ProviderID Not Set section above.
Manual fix:
# On workload cluster
kubectl taint nodes <node-name> node.cloudprovider.kubernetes.io/uninitialized-Symptoms:
- The VM is Running on Harvester.
- The HarvesterMachine object stays in
Provisioningstate with conditionVMRunning=TruebutMachineCreated=False. - Controller logs show:
waiting for VM IP addresses to be reported.
Cause:
The CAPHV controller reads IP addresses from the VMI's status.interfaces field. These are populated by the qemu-guest-agent running inside the VM. If the guest agent is not installed, not running, or not yet started, no IPs are reported.
Fix:
- Check the VMI status on Harvester:
# On Harvester kubectl get vmi <name> -n <ns> -o jsonpath='{.status.interfaces}' | python3 -m json.tool
- If interfaces is empty, SSH into the VM and check the guest agent:
systemctl status qemu-guest-agent
- If the agent is not installed, CAPHV's cloud-init userdata installs it automatically (
packages: [qemu-guest-agent]). If it was not installed:# On the VM zypper install -y qemu-guest-agent systemctl enable --now qemu-guest-agent
- Wait 30-60 seconds after the agent starts for KubeVirt to poll and update the VMI status. The controller reconciles every 30 seconds.
Symptoms:
- A control plane machine was deleted (e.g., by MachineHealthCheck) and a new one was created, but the etcd cluster still has a member entry for the old node.
etcdctl member listshows a member that is "unstarted" or has no name.- The etcd cluster reports unhealthy because the stale member cannot be reached.
- New control plane nodes fail to join etcd because the cluster has an unresolvable member.
Cause:
When a control plane machine is deleted, the corresponding etcd member should be removed. CAPHV performs automatic etcd member cleanup during the machine deletion reconcile via RemoveEtcdMember(). If the cleanup fails (e.g., workload cluster unreachable, no healthy etcd pod available), the stale member persists.
RKE2's own control plane controller also handles etcd member removal in most cases. The CAPHV cleanup is a safety net.
Fix:
- Identify the stale member:
# On workload cluster (from any healthy CP node) ETCDCTL_API=3 etcdctl \ --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \ --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \ --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \ --endpoints https://127.0.0.1:2379 \ member list -w table - Remove the stale member by its hex ID:
# On workload cluster ETCDCTL_API=3 etcdctl \ --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \ --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \ --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \ --endpoints https://127.0.0.1:2379 \ member remove <hex-member-id>
- Alternatively, use kubectl exec from the management cluster (this is what CAPHV does internally):
# Find a healthy etcd pod kubectl get pods -n kube-system -l component=etcd,tier=control-plane --kubeconfig <workload-kubeconfig> # List members kubectl exec -n kube-system etcd-<healthy-node> --kubeconfig <workload-kubeconfig> -- \ etcdctl --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \ --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \ --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \ --endpoints https://127.0.0.1:2379 \ member list -w json
- After removing the stale member, the replacement control plane node should be able to join the etcd cluster. If it is still stuck, restart the rke2-server service on the replacement node:
# On the replacement CP node systemctl restart rke2-server
List all CAPI and CAPHV resources:
kubectl get cluster,machine,harvestermachine,harvestermachinetemplate -AWatch CAPHV controller logs:
kubectl logs deploy/caphv-controller-manager -n caphv-system -fIncrease controller log verbosity (add to the deployment args):
kubectl edit deploy/caphv-controller-manager -n caphv-system
# Add --v=5 to the container argsCheck HarvesterMachine conditions:
kubectl describe harvestermachine <name> -n <ns>List all CAPI clusters with status:
kubectl get cluster -A -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,PHASE:.status.phase,READY:.status.conditions[0].statusCheck MachineHealthCheck status:
kubectl get machinehealthcheck -A
kubectl describe machinehealthcheck <name> -n <ns>List all IPPools and their availability:
kubectl get ippool -A -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,AVAILABLE:.status.availableCheck IP pool allocations (detailed):
kubectl get ippool <name> -n <ns> -o jsonpath='{.status.allocated}' | python3 -m json.toolList VMs with their status:
kubectl get vm -A -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,STATUS:.status.printableStatusList VMIs with IP addresses:
kubectl get vmi -A -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,PHASE:.status.phase,IPS:.status.interfaces[*].ipAddressCheck PVCs created by CAPHV:
kubectl get pvc -A -l harvesterhci.io/creator=caphvCheck cloud-init secret contents:
# Userdata
kubectl get secret <machine>-cloud-init -n <ns> -o jsonpath='{.data.userdata}' | base64 -d
# Networkdata (only present in static IP mode)
kubectl get secret <machine>-cloud-init -n <ns> -o jsonpath='{.data.networkdata}' | base64 -dCheck cloud-init status:
cloud-init status --longCheck cloud-init logs:
cat /var/log/cloud-init.log
cat /var/log/cloud-init-output.logCheck RKE2 service status:
journalctl -u rke2-server -f # Control plane nodes
journalctl -u rke2-agent -f # Worker nodesCheck node status and providerID:
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type,PROVIDER-ID:.spec.providerID,TAINTS:.spec.taints[*].keyCheck etcd cluster health:
ETCDCTL_API=3 etcdctl \
--cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt \
--cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt \
--key /var/lib/rancher/rke2/server/tls/etcd/server-client.key \
--endpoints https://127.0.0.1:2379 \
endpoint healthCheck cattle-cluster-agent (Rancher import):
kubectl logs deploy/cattle-cluster-agent -n cattle-system -f
kubectl get configmap serverca -n cattle-system -o yamlCheck network configuration on a VM:
ip addr show
ip route show
cat /etc/resolv.conf
wicked ifstatus --verbose eth0 # SLES only